Home Solar: From Scrape to Insights

Build a clean, automated pipeline that extracts and transforms production data from a home solar system with no public API. This repository focuses on robust extraction and transformation — designed to plug into a personal Airflow repository as a submodule. The data load/orchestration stage lives outside this repo by design.

Note on scope: This project is a submodule of a personal Airflow setup. The load stage is intentionally not included here.

Why This Exists

No official API: Data is hidden behind a login and a specific app view that enables an internal API.
Practical engineering: Demonstrates resilient scraping, structured transformations, and testable Python without over-engineering.
Personal analytics: Feeds a simple, consistent dataset for downstream visualization.

What It Does

Logs into the solar provider’s portal and fetches historical and current production data.
Transforms raw JSON into tidy, analysis-ready DataFrames (hourly and daily summaries).
Writes control artifacts (e.g., missing date lists) to ensure continuity across runs.

What It Doesn’t Do (Here)

Orchestration and loading to databases/data lake. These are handled by the parent Airflow repository where this module is consumed.

Tech Stack

Selenium: Reliable browser automation to reach the internal API endpoints.
Python + OOP: Clear separation of concerns and maintainability.
Pytest: Function-level tests for critical components.
Logging: Structured logs to aid debugging and observability.

Key Modules

src/missing_raw.py: Identifies dates with missing data and writes them to a control file.
src/extraction.py: Authenticates and pulls raw JSON from the portal (via Selenium-enabled flows).
src/transforming.py: Parses JSON to pandas DataFrames and produces hourly and daily aggregates.
main.py: Example runner wiring the steps together for local/debug usage.

Associated tests are in tests/ for extraction, transformation, and (where applicable) database-related helpers.

Typical Flow

Identify gaps: Generate/update a list of missing dates.
Extract data: Log in, navigate to the correct view, and request per-date JSON.
Transform data: Normalize, clean, and aggregate into hourly and daily tables.

Downstream loading/orchestration is performed by Airflow in the parent repository.

Quick Start (Local)

Set credentials in .env (USERNAME, PASSWORD).
Run python main.py to execute the identify → extract → transform sequence.
Staging/output locations are configured inside main.py and via environment variables.

Visualization

Public dashboard (sample): https://public.tableau.com/app/profile/lucas8230/viz/HOMESOLARPANELPRODUCTION2021-2024/Painel1

Notes

This is a personal, non-replicable setup tailored to a specific provider.
Network behavior and UI flows may change; scraping logic is built to be adaptable but may require updates over time.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.venv		.venv
.vscode		.vscode
__pycache__		__pycache__
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Home Solar: From Scrape to Insights

Why This Exists

What It Does

What It Doesn’t Do (Here)

Tech Stack

Key Modules

Typical Flow

Quick Start (Local)

Visualization

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

lksprado/Solar

Folders and files

Latest commit

History

Repository files navigation

Home Solar: From Scrape to Insights

Why This Exists

What It Does

What It Doesn’t Do (Here)

Tech Stack

Key Modules

Typical Flow

Quick Start (Local)

Visualization

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages