Andrew Friedman
afriedman412 [at] gmail [dot] com
Home • Work • Projects • Open Source • Writing • ContentWORK
ML Engineer
1-1-2022 to present
Sludge
Build and operate production AI/data systems that extract structured information from FEC campaign-finance filings, congressional stock-trading disclosures and municipal budget PDFs. The pipelines power ongoing newsroom queries and super-PAC industry-flow visualizations.
- Designed a RAG pipeline over thousands of municipal budget PDFs (ChromaDB + OpenAI embeddings) with metadata filtering and async producer-consumer ingest, plus crash-resistant subprocess isolation and SQLite-based job-state tracking.
- Owned the full lifecycle of fine-tuned extraction models (Mistral 7B with LoRA) — dataset curation, evaluation against held-out ground truth, and production deployment.
- Productionalized real-time FEC ingestion as a containerized FastAPI service on GCP Cloud Run and Cloud SQL, with Cloud Scheduler triggering ingest jobs and multi-tier job sizing for heavy parallel workers.
- Implemented retry/backoff, idempotent dedup and checkpoint/resume to ingest 1.5M+ FEC records under API rate limits. Automated deploys to Cloud Run via GitHub Actions with OIDC.
A few stories that used data from these systems:
- The Members of Congress Who Profit From War
- Members of Congress Own Up to $93 Million in Fossil Fuel Stocks
- Revealed: how US senators invest in firms they are supposed to regulate
- Reps Questioning Megabank CEOs Own Stock in Their Companies
Data Scientist (Contract)
1-1-2023 to 6-1-2023
Center for Just Journalism
In partnership with the NYU Wagner School of Public Service, I worked with graduate students to investigate American newspapers' reliance on police sources when reporting on crime, and how that affected coverage of both crime and police. While the students conducted an in-depth analysis of a representative 300 article sample, I used a programmatic approach to analyze the full 100,000 article data set.
KEY RESPONSIBILITIES:
- Development of a standalone Python package for quote identification, attribution and resolution
- Acquisition and processing of 100,000 articles
- Lexis/Nexis query optimization to minimize irrelevant or off-topic articles
- Topic modeling to verify the fidelity of the students' sub-sample
Data Lead
6-1-2022 to 6-2-2023
Google/Medill Data-Driven Reporting Project
Member of a team awarded a grant from the Google/Medill Data-Driven Reporting Project to study 30 years of detailed crime statistics obtained from the Baltimore Police Department.
Results of the study are being published as a multipart series in The Real News:
- Part 1: Baltimore's Crime Numbers Game
- Part 2: The Short History and Long Tail of Baltimore’s “Zero Tolerance” Policing
- Part 3: An Audit of Baltimore City's Data Integrity
- Part 4: An Evaluation of City Budget and Health Metrics
Data Scientist
11-1-2018 to 8-1-2022
Remarkable AI (fka Chatdesk)
Remarkable AI (previously Chatdesk) is a Series A company backed by leading Silicon Valley investors like Menlo Ventures, Susa Ventures and Slow Ventures in the customer service space, whose customers include leading brands like Grubhub, BarkBox, Thinx, and OLAPLEX.
KEY RESPONSIBILITIES:
- Deployed message classification model for 1 million weekly messages with 99.5% accuracy
- Implemented Named Entity Recognition to increase flexibility of cleaning code
- Developed and maintained code base for cleaning and standardization of incoming messages for downstream processing for 100+ companies in 10+ languages, from diverse sources (Zendesk, Salesforce, Intercom, Facebook, Instagram)
- Attended client meetings for technical integration and conduct data analysis to help our sales team
Instructional Associate
9-1-2018 to 5-1-2020
General Assembly
Train students in data science methods, concepts and technologies, including: bash, python, data mining, supervised and unsupervised learning techniques, model building, forecasting, SQL, AWS and NLP.