M.S. Data Science, University of Houston (4.0 GPA) · B.Tech CS · 3 published papers (IEEE, ResearchGate) · Databricks Data Engineer Associate certified · AWS Solutions Architect certified.
I build end-to-end data platforms — from Kafka ingestion to dimensional models to serving layers. The three flagship projects below use deliberately different architectures because their data behaves differently. Lambda for stateful order lifecycles. Kappa for append-only sensor streams. Multi-agent pipeline for clinical AI. Data Vault where schemas conflict. Star where dimensions are flat. Snowflake where hierarchies run deep.
Choosing the right pattern matters more than knowing every tool.
Currently Software Engineer at Url Systems Inc · Previously University of Houston.
Three production-grade platforms. Different architectures, different modeling, different cloud providers — each chosen for a reason.
|
Real-Time Dark Kitchen Intelligence Platform 50 kitchens · 10 cities · 3–5 brands per kitchen |
Wearable Health Analytics & Anomaly Detection 100K patients · 4 device types · 3 data sources |
Healthcare Reasoning & Analytics Platform Multi-agent clinical AI + data engineering |
GhostKitchen PulseTrack HERA v4
────────────── ────────────── ──────────────
Data nature Stateful (order lifecycle) Append-only (sensor readings) Multi-stage AI pipeline
Architecture Lambda (batch + streaming) Kappa (streaming only) 9-stage Command Center
Why? Orders get cancelled/refunded Heart rate of 72 bpm doesn't Clinical reasoning needs
→ need batch recomputation get "corrected" — one sequential stages: NER →
for exact revenue numbers pipeline handles everything agents → ML → RAG → eval
Silver model Data Vault 2.0 Normalized + metric explosion Star schema warehouse
Gold model Star Schema Snowflake Schema Hourly aggregations
Late data DLQ → nightly reconciliation Same pipeline, no DLQ DLQ in event streaming
Airflow role Conductor (orchestrates batch) Janitor (maintenance only) Custom 16-task ETL DAG
Cloud AWS (S3, Redshift, DynamoDB) Azure (Blob, Cosmos DB) AWS App Runner + ECR
Serving Metabase Apache Superset Grafana + Prometheus
Lambda Architecture — dual-path data flow
flowchart LR
subgraph SRC["Data Sources"]
O["Orders\n3 platforms\n500/min"]
S["IoT Sensors\n2,000/min"]
G["GPS Pings\n5,000/min"]
M["Menu CDC\n200/day"]
F["Feedback\nBatch CSV"]
end
subgraph SPEED["⚡ Streaming Path — Speed Layer"]
K[Apache Kafka]
SS[Spark Structured\nStreaming]
B1[Bronze\nDelta Lake]
S1[Silver\nData Vault]
G1["Gold\nStar Schema\n(approximate)"]
end
subgraph BATCH["📦 Batch Path — Truth Layer"]
AF[Airflow\nDaily 4AM]
SB[Spark Batch]
S2[Silver\nData Vault]
G2["Gold\nStar Schema\n(exact)"]
end
subgraph SERVE["Serving"]
MB[Metabase]
DDB[DynamoDB API]
RS[Redshift SQL]
end
O --> K
S --> K
G --> K
M --> K
K --> SS --> B1 --> S1 --> G1
F --> AF
B1 --> AF --> SB --> S2 --> G2
G2 -- "Reconciliation 6AM\nbatch overwrites streaming" --> G1
G1 --> MB
G1 --> DDB
G1 --> RS
Why Lambda here? Orders transition through states: placed → confirmed → preparing → ready → picked_up → delivered (or cancelled at any stage). A delivered order might be refunded 2 hours later, changing final revenue. The streaming path gives dashboards fast-but-approximate numbers. The daily batch path reprocesses 48 hours of Bronze to produce exact numbers and overwrites the streaming Gold. Delta Lake MERGE makes this atomic overwrite safe.
Data Vault 2.0 (Silver) — why not Star Schema here?
Three platforms define the same entities differently:
| Field | Uber Eats | DoorDash | OwnApp |
|---|---|---|---|
| Customer ID | customer_uid |
dasher_customer_id |
user_id |
| Order total | total_amount (float) |
order_value (float) |
amount_cents (integer!) |
| Timestamp | order_timestamp |
created_at |
timestamp |
Star Schema would force premature alignment. Data Vault solves this:
- Hubs unify identity — one
hub_customerrow per real person, regardless of platform - Satellites preserve source-specific attributes with full SCD2 history
- Links capture relationships (which order → which customer → which kitchen+brand)
Gold then transforms Data Vault → Star Schema for analyst-friendly 2-table joins.
Key tables:
HUBS LINKS SATELLITES
hub_customer link_order_customer sat_customer_profile (SCD2)
hub_order link_order_kitchen_brand sat_order_details
hub_kitchen sat_order_status (state machine)
hub_menu_item sat_menu_item_details (SCD2)
Identity resolution — 3 platforms → 1 customer
customer_identity_bridge
├── platform: uber_eats / doordash / own_app
├── platform_customer_id: ue_cust_44821
├── email_hash: MD5(normalized email)
├── phone_hash: MD5(phone) — nullable
├── match_confidence: 1.0 (exact) ... 0.6 (fuzzy)
└── match_method: exact_email / fuzzy_name_address / manual_override
Algorithm:
- Normalize emails (lowercase, trim)
- Group all platform IDs by
email_hash - Assign one
customer_keyper email group - Null emails → fuzzy match on
name + delivery_address - Store
match_method+confidencefor auditability
A customer ordering from Uber Eats AND DoorDash AND the direct app resolves to one row in dim_customer. Their cross-platform behavior is now visible: average order value per platform, platform loyalty, cannibalization analysis.
Dimensional model (Gold) — fact tables & grains
| Fact Table | Grain | Why this grain |
|---|---|---|
fact_order |
1 row per order (final state) | CEO needs: count, revenue, delivery time |
fact_order_state_history |
1 row per order × status change | Ops needs: bottleneck analysis per state |
fact_sensor_hourly |
1 row per kitchen × sensor × zone × hour | 1.3M raw events/day → hourly rollup for dashboards. Atomic stays in Silver for ML. |
fact_delivery_trip |
1 row per delivery | Aggregated from raw GPS pings. Pings too granular for dashboards. |
Dual-grain strategy: Atomic events in Silver for correctness + pre-aggregated in Gold for performance. Every fact table has a deliberate grain decision documented.
SCD types used: Type 0 (zones — never change) · Type 1 (kitchens — overwrite) · Type 2 (menu items, customers — full history)
Bridge tables: bridge_kitchen_brand (M:M — one kitchen runs multiple brands) · customer_identity_bridge (cross-platform resolution)
Kappa Architecture — single streaming engine
flowchart LR
subgraph SRC["Data Sources"]
W["Wearable Devices\n4 types"]
P["Pharmacy\nCDC Events"]
E["EHR\nFHIR Batches"]
end
subgraph KAPPA["🔄 Single Streaming Pipeline"]
K[Apache Kafka]
SS["Spark Structured\nStreaming"]
B[Bronze\nDelta Lake]
SL["Silver\nNormalized\nMetric Explosion"]
GL["Gold\nSnowflake Schema"]
end
subgraph MAINT["🔧 Airflow — Maintenance Only"]
IR["Identity Resolution\nDaily 2AM"]
DQ["Data Quality\nEvery 6h"]
CM["Compaction\nDaily"]
HA["HIPAA Audit\nWeekly"]
end
subgraph SERVE["Serving"]
SU[Superset]
CO[Cosmos DB API]
AZ[Azure Functions\nAlerts]
end
W --> K
P --> K
K --> SS --> B --> SL --> GL
E --> B
SL -.-> IR
SL -.-> DQ
GL -.-> CM
GL -.-> HA
GL --> SU
GL --> CO
GL --> AZ
Why Kappa here? A heart rate reading of 72 bpm at 2:34pm doesn't get "corrected" later. It's an immutable measurement. No need for Lambda's dual-pipeline complexity. One streaming engine handles real-time, late-arriving data (30% of events arrive hours late from batch sync), AND backfills (reset Kafka offsets, replay through same code). Same pipeline, same logic, one truth.
Airflow's role is completely different here. In GhostKitchen, Airflow is the conductor — it orchestrates the batch ETL path. In PulseTrack, Airflow is the janitor — it handles maintenance (identity resolution refresh, data quality sweeps, Delta compaction, HIPAA audits) while the streaming pipeline runs continuously on its own.
Metric explosion — the core Silver transformation
Bronze stores one row per device reading with nested metrics:
{"device_id": "SW-001", "metrics": {"heart_rate_bpm": 72, "spo2_pct": 98, "hrv_ms": 45}}Silver explodes this to one row per metric:
SW-001 | heart_rate_bpm | 72.0 | ✅ valid
SW-001 | spo2_pct | 98.0 | ✅ valid
SW-001 | hrv_ms | 45.0 | ✅ valid
Why? Anomaly detection, dashboards, and aggregations all operate on individual metrics. Nested JSON forces complex extraction on every query. Explosion is done once in Silver, benefitting every downstream consumer.
Quality flags (not dropped):
| Metric | Valid Range | Outside range → |
|---|---|---|
| heart_rate_bpm | 30–220 | is_valid = false (kept for anomaly detection) |
| spo2_pct | 70–100 | is_valid = false |
| blood_glucose_mgdl | 40–400 | is_valid = false |
| skin_temp_celsius | 30–42 | is_valid = false |
Invalid readings are never dropped — they're flagged and kept because out-of-range values may indicate genuine medical events that anomaly detection should evaluate.
Snowflake Schema (Gold) — why not Star here?
Healthcare dimensions are deeply hierarchical:
- ICD-10 codes: Chapter → Category → Specific Code (3 levels)
- Medications: Drug Class → Medication → Dosage → Patient Assignment
- Devices: Device Type → Firmware Version → Patient Reassignment
Star Schema would create a "kitchen sink" dim_patient with 50+ columns. Snowflake normalizes these hierarchies into separate dimension tables with foreign keys.
Fact tables:
| Table | Grain |
|---|---|
fact_vital_reading |
1 row per patient × metric × timestamp (atomic) |
fact_vital_daily_summary |
1 row per patient × metric × day (pre-aggregated) |
fact_activity_session |
1 row per exercise session |
fact_lab_result |
1 row per lab test |
fact_prescription_fill |
1 row per pharmacy dispense |
fact_anomaly_alert |
1 row per detected anomaly |
Dimension tables with SCD types:
| Dimension | SCD | Why |
|---|---|---|
dim_patient |
Type 2 | Lean — no conditions/meds crammed in |
dim_medication |
Type 2 | Hardest SCD2: start/stop/dosage change/restart |
dim_condition → dim_condition_category |
Type 2 → Type 0 | ICD-10 hierarchy normalized |
dim_device |
Type 2 | Firmware versions + patient reassignment |
dim_drug_class |
Type 0 | Classification doesn't change |
Identity resolution — 7 identifier types → 1 patient
The hardest engineering problem in PulseTrack. A single patient may appear as:
device_account_id → email → hospital_mrn → pharmacy_id → insurance_id → phone_hash → ssn_hash
Multi-hop graph matching: device_account links to email, email links to hospital_mrn, hospital_mrn links to pharmacy_id — all resolving to one unified patient_key via the patient_identity_bridge table (1,159 rows for 100K user dataset).
This is significantly more complex than GhostKitchen's 3-platform identity resolution because healthcare identifiers are fragmented across completely independent systems (hospitals, pharmacies, insurers, device manufacturers) with no shared login.
Anomaly detection — personal baselines, not population thresholds
A runner with resting HR 52 showing HR 85 is more concerning than a sedentary person at HR 85. Population-level thresholds miss this.
Per-patient rolling statistics (mean, σ for each metric over 30 days). New reading compared against personal baseline. Deviation > 2σ = anomaly alert.
- State: Keyed by
(patient_key, metric_name). Each key stores running stats + circular buffer. - Backed by: RocksDB state store in Spark Structured Streaming.
- TTL: 90 days for inactive patients.
- Output:
fact_anomaly_alertin Gold → Azure Functions → email/SMS notifications.
HIPAA compliance — built in, not bolted on
- De-identification: Gold uses
age_bracket(not exact age),city(not address), no names - Encryption: Delta Lake files encrypted at rest
- Access control:
data_engineer(all layers),clinical_analyst(Gold only, PII masked),researcher(Gold, fully de-identified) - Deletion: Delta Lake DELETE across all layers → verification sweep → audit log
- Retention: HIPAA requires 7-year retention. Patient deletion = de-identify PII but keep analytical record.
- Audit: Weekly
dag_hipaa_auditscans for PII leakage across all layers
Production-grade healthcare AI platform in two layers: AI Layer (multi-agent clinical reasoning, RAG, biomedical NER, risk prediction, FHIR R4) and Data Engineering Layer (event streaming, column-level lineage, data quality, star schema warehouse, ETL, CDC, data catalog). Both unified through a Command Center that runs all 9 stages as a single pipeline per patient encounter. Deployed on AWS App Runner via GitHub Actions CI/CD.
9-stage Command Center pipeline
flowchart TD
INPUT["Patient Vitals + Clinical Notes + FHIR Bundle"]
INPUT --> S1["Stage 1: NER Extraction\n33 meds · 28 conditions · 25 procedures"]
S1 --> S2["Stage 2: Knowledge Graph\nNetworkX DiGraph · 3 edge types\n20 treatment · 5 contraindication · 9 diagnostic"]
S1 --> S3
S2 --> S3["Stage 3: Multi-Agent Reasoning"]
subgraph AGENTS["Three Agents — Sequential"]
T["Triage Agent\nESI v4 · 6 vital thresholds"]
D["Diagnostic Agent\nICD-10 · 6 complaint categories"]
TR["Treatment Agent\n8 protocols · drug interactions\nallergy cross-reference"]
T --> D --> TR
end
S3 --> AGENTS
AGENTS --> S4["Stage 4: Risk Prediction\nRandom Forest + SHAP\n9 features · explainable"]
AGENTS --> S5["Stage 5: RAG Retrieval\nFAISS + MiniLM-L6-v2\n15 curated guidelines"]
S5 --> S6["Stage 6: Clinical Summarization\nT5 Transformer · beam search\nauto-fallback to pretrained"]
S6 --> S7["Stage 7: Safety Evaluation\n4-axis LLM-as-Judge\nfactual · hallucination · accuracy · safety"]
S7 -- "score < threshold" --> S6
S7 --> S8["Stage 8: FHIR R4 Export\nPatient · Observation · RiskAssessment\nLOINC-coded vitals"]
S4 --> S9
S8 --> S9["Stage 9: Data Engineering\n7 integrated systems"]
One POST /api/command-center runs all 9 stages per patient encounter. No manual orchestration. Stage 7 (Safety Evaluation) has a feedback loop — if the score is below threshold, it triggers re-summarization with RAG context injection.
Consensus score: 0.3 × risk + 0.4 × confidence + 0.3 × evidence_grade
Multi-agent clinical reasoning — why agents, not a monolithic LLM
Clinical reasoning is multi-step. A single LLM call can't provide the auditability, separation of concerns, and protocol adherence required in healthcare.
| Agent | Input | Output | Method |
|---|---|---|---|
| Triage (ESI v4) | Vitals + complaint | ESI level 1–5 | 6 vital thresholds × 4 levels + high-acuity keywords + resource estimation |
| Diagnostic (ICD-10) | Triage + NER entities | Differential diagnoses with probabilities | Evidence overlap − rule-out penalty + acuity boost + age factor |
| Treatment | Diagnosis + patient history | Treatment plan | 8 protocols by ICD-10 + drug interaction checking (5 pairs) + allergy cross-reference |
The orchestrator chains all three and records a full audit trail (JSONB in PostgreSQL). Every decision is traceable.
Key design decision: RAG over fine-tuning for medical knowledge. Guidelines change — RAG enables hot-swap without retraining, and citations provide provenance.
Data Engineering layer — 7 systems, zero external dependencies
Each system is ~200–500 lines of focused Python implementing production-faithful patterns:
| System | Lines | What it does |
|---|---|---|
| Event Streaming | 422 | Kafka-style with schema registry (5 schemas), MD5 partitioning, consumer groups, DLQ |
| Column-Level Lineage | 506 | DAG tracking 36+ nodes across 8 pipeline stages, impact_analysis(), PII flagging |
| Data Quality | 349 | 12 checks across 5 categories (schema, completeness, accuracy, consistency, freshness) |
| Star Schema Warehouse | 394 | fact_clinical_encounters (22 cols) + 4 dimensions + hourly aggregation |
| ETL Orchestrator | 437 | 16-task DAG with topological sort, retry, SLA monitoring, upstream failure propagation |
| CDC | 246 | Before/after snapshots, SHA-256 checksums, field-level diffs, event replay |
| Data Catalog | 499 | 12 datasets with PII tracking, freshness SLAs, relevance-scored search |
Why custom over Airflow/Kafka/dbt? Zero external dependencies. Production-faithful patterns that run with just pip install. Demonstrates understanding of the internals, not just tool configuration.
ETL DAG execution flow:
ingest → validate → extract_entities → build_kg
→ run_triage → run_diagnosis → run_treatment
→ predict_risk
run_diagnosis → retrieve_rag → generate_summary → evaluate_safety
↓
export_fhir ← [run_treatment, predict_risk, generate_summary, evaluate_safety]
├── load_warehouse → emit_cdc → update_catalog
└── track_lineage → update_catalog
Safety evaluation — LLM-as-Judge over ROUGE
ROUGE can't detect hallucinations or clinically dangerous errors. HERA uses a 4-axis evaluation:
| Axis | Weight | Method |
|---|---|---|
| Factual consistency | 0.30 | Keyword overlap + 7 contradiction pair detection |
| Hallucination | 0.25 | Entity precision/recall + fabricated claim detection |
| Medical accuracy | 0.25 | 157 valid terms + dangerous dosage checks (5 drugs) |
| Clinical safety | 0.20 | 3 dangerous regex patterns + allergy cross-check |
Overall score = weighted sum. If below threshold → feedback loop re-triggers summarization with RAG context.
Infrastructure — deployed, monitored, tested
Deployment pipeline:
Push to main → GitHub Actions CI (lint + 104 tests + Docker build) → ECR → AWS App Runner auto-deploys
Observability (Prometheus + Grafana):
- 17-panel Grafana dashboard (auto-provisioned): System Health, Clinical Risk & Triage, API Performance, Pipeline & Data Engineering
- Prometheus scrapes
/metricsevery 5s:hera_requests_total,hera_request_failures_total,hera_request_latency_seconds - Alert rule:
TooManyHighRiskfires whenhigh_risk_predictions > 3for 30s
PostgreSQL (5 tables): patient_predictions, summaries, clinical_reasoning_sessions, ner_extractions, evaluation_reports — all populated automatically per API call.
Middleware: API key auth + rate limiting (30/min anon, 60/min auth) + HIPAA audit trail with trace IDs.
Cost: AWS App Runner 1 vCPU, 2GB RAM, auto-scales to 0 → ~$5–7/mo.
Key design decisions
| Decision | Rationale |
|---|---|
| Multi-agent over monolithic LLM | Clinical reasoning is multi-step. Agents provide separation of concerns and auditability. |
| RAG over fine-tuning for knowledge | Medical guidelines change. RAG enables hot-swap without retraining. Citations provide provenance. |
| FHIR R4 for interoperability | Mandated for US healthcare data exchange (21st Century Cures Act). |
| Random Forest + SHAP over deep learning | 9 features, binary target. "SpO2 and MAP drove this High Risk" is more valuable than marginal accuracy from a black box. |
| Column-level over table-level lineage | Field-level provenance for HIPAA compliance and impact analysis. Table-level is too coarse for healthcare governance. |
| Custom DE systems over Airflow/Kafka/dbt | Demonstrates understanding of internals. Zero dependencies. Production-faithful patterns in ~300 lines each. |
| Unified Command Center over microservices | One POST runs all 9 stages. Atomic processing. Decompose into microservices at 100K+ patients/day. |
| Project | Stack | What it does |
|---|---|---|
| Real-Time Stock Pipeline | Kafka · Spark · Airflow · S3 · Redshift · Tableau | Streaming stock market data → cloud warehouse → analytics dashboards |
| Quantum Image Processing | Qiskit · Python · Jupyter | Quantum computing approaches to classical image processing problems |
| Financial News Sentiment | NLP · Python | Sentiment classification for financial news articles |
| Paper | Venue | Link |
|---|---|---|
| Classification of Parkinson's Disease in Brain MRI Images Using Deep Residual CNN | ResearchGate | Read → |
| Efficient Smart Micro-Scale Solar Power Management System for Rechargeable Nodes | IEEE | Read → |
| Estimation of Periodontal Bone Loss Using SVM and Random Forest | IEEE | Read → |
| Certification | Issuer |
|---|---|
| Databricks Data Engineer Associate | Databricks |
| AWS Certified Solutions Architect | Amazon Web Services |
| Tableau Desktop Specialist | Tableau / Salesforce |
| UiPath Automation Developer Associate | UiPath |
| Degree | Institution | GPA |
|---|---|---|
| M.S. Engineering Data Science | University of Houston (2023–2025) | 4.0 / 4.0 |
| B.Tech Computer Science | VR Siddhartha Engineering College |