The aim of this repository is both to provide tools for Heavy crunching data (deep statistical analyses, Machine Learning methods refactored to DBT Jinja SQL, etc) and Big Data best practices with DBT (cleaners to run be triggered to maintain your datasets unpolluted, metadata crawlers for BigQuery, etc).
Currently focused on GCP work with BigQuery. Support to the mission & PRs are also accepted.
Maybe one day this can be turned into a DBT package to install.
Data processing macros will be developed using dummy CSVs as DBT seeds. Then they will be run against massive columns. Processing rows and computing times will be added to the documentation
Currently working with Python 3.11.9. DBT/SQL libraries at requirements.txt
- Macros for data processing will be tested using CSVs as seeds to create the input and expected output
Description:
Helps to keep the BigQuery environment clean and organized. Automatically removes redundant objects in BigQuery (tables that are not needed anymore, tables that were renamed and the old versions still exist, etc)
Path:
macros/utils/bq_cleaner.sql
...
...
⚒️ In progress
String Occurrence Count
📋 TODO
If there is a specific functionality that you would like to cover with DBT, contact me.
Also support and PRs are accepted
TDF-IDF
Max-min Scaler
Z-score Scaler