This repository contains a Jupyter notebook for extracting structured zoning and development-standard data from municipal zoning code PDFs using the Gemini API batch workflow.
This work was developed as part of PSRC's FLU, or Future Land Use, work. The Puget Sound Regional Council (PSRC) provides regional data and forecasting resources used by local planners and decision-makers to understand how the region may grow and change over time. This zoning extraction workflow supports that broader land use context by helping convert municipal zoning code PDFs into structured fields that can be reviewed, cleaned, and used in planning or modeling workflows.
Gemini_API_Batch.ipynb runs a batch job against PDFs stored in the Cities folder. For each zoning PDF, it:
- Reads the Gemini API key from a local
key.txtfile. - Uploads each PDF in
Cities/*.pdfto Gemini. - Sends a zoning extraction prompt to
models/gemini-2.5-pro. - Polls the Gemini batch job until it finishes.
- Parses JSON responses into a dataframe.
- Saves the results as:
all_zoning_data_batch.csvall_zoning_data_batch.jsonall_zoning_batch_raw.jsonl
The extracted fields include jurisdiction, zone name, density, FAR, height, lot coverage, bonus availability, use flags, rural flags, and ADU notes.
This notebook uses Gemini batch processing. Batch processing is asynchronous: instead of returning each response immediately, the code submits many PDF extraction requests together, receives a batch job ID, polls for completion, and then downloads/parses the completed responses.
The main benefit is cost. Gemini Batch API pricing is intended for high-volume, non-real-time work and provides a 50% discount compared with standard synchronous calls. The tradeoff is timing: batch jobs may take up to 24 hours to complete, although they may finish sooner.
Refer to the official Gemini Batch API and Gemini pricing documentation for current pricing details:
In a nutshell: batch processing is cheaper, but it is asynchronous and can take up to 24 hours.
The code can be easily reconstructed to run synchronous one-by-one calls instead. In that version, each PDF would be uploaded or attached, sent to the model with a regular generate_content request, parsed immediately, and appended to the output files before moving to the next PDF. That approach is simpler for debugging and gives immediate per-file responses, but it does not receive the batch-processing discount.
- A Gemini API key.
- A local
key.txtfile containing only the Gemini API key. - Zoning PDFs placed in the
Citiesfolder. - Python packages:
google-genaipandasipython
Example local setup:
pip install google-genai pandas ipythonThen create key.txt locally:
YOUR_GEMINI_API_KEY
Do not commit key.txt to GitHub.
Full city codes can be very large and may not be parsed reliably by the API in one request. When possible, use only the zoning section or other relevant sections of the municipal code instead of uploading the entire city code. If the code is too large, truncate the source material to the zoning, land use, density, FAR, height, lot coverage, use table, overlay, bonus, and ADU sections needed for the extraction.
The notebook expects those relevant zoning PDFs to be available in the local Cities folder.
Developed by Mohammad Mehdi Oshanreh
Data Intern at PSRC