A solution accelerator to standardizing, centralizing, and ingesting Databricks cluster logs for enhanced log searching, troubleshooting, and optimization.
The goal of this repo is to provide platform administrators and data engineers a reference implementation for scalable log ingestion, as well as broader guidance for practitioners to use logging in jobs.
The included dashboard automatically detects common Spark issues without requiring manual log analysis. It provides pre-built pattern detectors for performance problems (OutOfMemory errors, spill to disk, lost executors) and privacy concerns (df.show() statements in logs), alongside flexible log search and Top N analysis capabilities. This allows you to quickly identify and troubleshoot job failures by surfacing the most relevant issues in an easily visible way.
Before deploying this solution, ensure you have:
- Unity Catalog: A catalog named
watchtowermust be created in your workspace (or provide your own catalog name by modifying thecatalogvariable indatabricks.yml) - SQL Warehouse: A SQL warehouse must be available and referenced in
databricks.yml(default: "Shared Unity Catalog Serverless")
-
Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/install.html
-
Authenticate to your Databricks workspace (if you have not done so already):
$ databricks configure -
To deploy a development copy of this project, type:
$ databricks bundle deploy --target dev(Note that "dev" is the default target, so the
--targetparameter is optional here.)This deploys everything that's defined for this project. For example, the default template would deploy a Lakeflow Pipeline called
[dev yourname] watchtower_pipelineto your workspace. You can find that pipeline by opening your workpace and clicking on Jobs & Pipelines. -
Similarly, to deploy a production copy, configure a
prodtarget in databricks.yml and type:$ databricks bundle deploy --target prod -
To deploy and run integration tests:
$ databricks bundle deploy --target staging $ databricks bundle run integration_tests --target staging -
To run a job, use the "run" command:
$ databricks bundle run [KEY] -
Optionally, install developer tools such as the Databricks extension for Visual Studio Code from https://docs.databricks.com/dev-tools/vscode-ext.html.
-
For documentation on the Databricks Asset Bundles format used for this project, and for CI/CD configuration, see https://docs.databricks.com/dev-tools/bundles/index.html.
To completely tear down all resources that were deploeyd as part of this asset bundle:
databricks bundle destroy --target devIf you also used the provided Terraform to deploy your Unity Catalog Volume resources, you can tear those down as well:
cd terraform && terraform destroy && cd ..To run the linter by itself:
hatch run lint:checkTo run the linter with automatic fixing (when possible):
hatch run lint:fixRun all static analysis tools with one command:
hatch run analyze:allTo run unit tests:
hatch run test:testTo enable automatic testing on Pull Requests:
-
Add GitHub Repository Secrets:
- Go to your repo → Settings → Secrets and variables → Actions
- Add secret:
DATABRICKS_TOKEN(your Databricks token) - Optionally add variable:
DATABRICKS_HOST
-
What happens automatically:
- Pull Requests: Validated and tested with isolated workspace paths
- Main branch: Deployed to your dev environment
- PR cleanup: Resources automatically cleaned up when PR is closed
- Pipeline:
Databricks log ingestion pipeline - Workflow:
watchtower_demo_job - Dashboard:
Logs Dashboard(deployed alongside pipeline) - includes log search with filtering, Top N analysis for errors and jobs, and pre-built detectors for common issues like OOM errors, spill to disk, lost executors, and privacy concerns - Location:
/Workspace/Users/your-email@company.com/.bundle/watchtower/
cd terraform && terraform init && terraform apply && cd .. # Deploy Catalog resources and init scripts
databricks bundle validate # Check configuration
databricks bundle deploy # Deploy to workspace
databricks bundle run demo_workflow # Run the demo workflow
databricks bundle summary # See what's deployed
databricks bundle destroy # Remove everything
cd terraform && terraform destroy && cd ..Watchtower automatically manages the cleanup of processed raw log files using Databricks autoloader's cleanSource feature. This helps control storage costs and meet data retention compliance requirements.
You can customize the retention behavior using these variables in databricks.yml:
-
raw_log_retention_action(default:DELETE)- Valid values:
DELETE,MOVE,OFF DELETE: Permanently removes processed files after the retention periodMOVE: Archives processed files to a cold storage locationOFF: Disables automatic cleanup (files are never deleted)
- Valid values:
-
raw_log_retention_duration(default:30 days)- How long to retain raw log files before cleanup
- Format:
<number> <unit>(e.g.,7 days,90 days,1 month)
-
raw_log_retention_move_destination(default:"")- Destination path when using
MOVEaction - Leave empty when using
DELETEaction - Example:
/Volumes/watchtower/default/archived_logs/
- Destination path when using
Keep logs for 7 days, then delete:
variables:
raw_log_retention_action: DELETE
raw_log_retention_duration: 7 daysArchive logs to cold storage after 90 days:
variables:
raw_log_retention_action: MOVE
raw_log_retention_duration: 90 days
raw_log_retention_move_destination: /Volumes/watchtower/default/archived_logs/Disable automatic cleanup:
variables:
raw_log_retention_action: OFFNote: This feature was highlighted in the Databricks blog on scalable logging under Operational Considerations. While you can also implement retention using cloud provider lifecycle rules (S3/ADLS/GCS), this built-in approach is simpler and doesn't require additional cloud configuration.
- Update
databricks.ymlor theresources/*.ymlfiles with your job/pipeline settings - Modify
src/watchtower/log_ingest_pipeline.pyto customize the log ingestion pipeline - Modify the workspace
hostandroot_pathas needed - Modify
terraform/main.tfas needed, or create aterraform/.auto.tfvarsfile to override Terraform variables.