Skip to content

surakifalenye/substack-publications-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

Substack Publications Scraper πŸ“š

Extract comprehensive Substack publication data from keyword-based search results, including newsletter metadata, authors, subscriber metrics, and theme settings. This Substack publications scraper helps analysts, marketers, and creators quickly map the newsletter landscape and benchmark audience reach.

Bitbash Banner

Telegram Β  WhatsApp Β  Gmail Β  Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for substack-publications-scraper you've just found your team β€” Let’s Chat. πŸ‘†πŸ‘†

Introduction

Substack Publications Scraper lets you search Substack by keywords and collect structured information about matching newsletters. Instead of manually opening each publication, you get key details such as author profile, subscriber ranges, creation dates, and engagement-related settings in one dataset.

This project is ideal for content marketers, media analysts, newsletter founders, and data teams who need a reliable way to analyze Substack publications at scale, perform competitive research, or build internal dashboards around newsletter performance.

Newsletter analytics & discovery

  • Discover relevant Substack publications using one or more keyword queries.
  • Capture rich publication metadata, including tags, descriptions, and hero text.
  • Track subscriber counts and ranking details across multiple languages.
  • Analyze theme, layout, and feature flags to understand how top newsletters are configured.
  • Export clean, machine-readable data for use in BI tools, CRMs, and custom automations.

Features

Feature Description
Keyword-based publication search Find Substack publications by one or more keywords and collect structured results automatically.
Detailed publication metadata Capture names, descriptions, hero text, domains, status flags, and configuration options for each publication.
Author profiles & contributors Extract author names, bios, handles, profile photos, and contributor roles to understand who runs each newsletter.
Subscriber & ranking insights Collect ranking detail text, approximate free subscriber counts, and language-specific ranking descriptors.
Theme & appearance settings Retrieve theme colors, homepage layout, hero behavior, and other UI-related configuration values.
Community & podcast flags See whether community features, podcasts, and chat/notes content are enabled for each publication.
JSON dataset output Store all results in structured JSON records suitable for analysis, enrichment, and downstream pipelines.
Configurable volume Control how many publications to fetch per run using a simple maxItems input parameter.

What Data This Scraper Extracts

Field Name Field Description
keyword The search term that produced this publication result.
name The public name of the newsletter/publication.
hero_text Introductory tagline or hero text shown on the publication homepage.
description Brief description or summary text of the publication (if available).
subdomain Substack subdomain used by the publication.
base_url Canonical web URL of the publication homepage.
hostname Hostname used for the publication (useful for domain grouping).
language Primary language code of the publication (e.g. en, de, es).
created_at Timestamp indicating when the publication object was created.
first_post_date Date/time of the first published post for this newsletter.
type Type of publication (e.g. newsletter).
community_enabled Boolean flag indicating whether community features are enabled.
has_community_content Whether the publication currently has any community content.
has_podcast Whether the publication has any podcast content configured.
has_free_podcast Whether a free podcast feed is available.
has_subscriber_only_podcast Whether subscriber-only podcast episodes exist.
author_id Internal numeric identifier for the main author.
author_name Full display name of the primary author.
author_handle Public handle/slug of the author on the platform.
author_photo_url Direct URL to the author profile image.
author_bio Short biography text describing the author.
contributors Array of contributor objects (name, handle, role, bio, photo URL, owner flag, user ID).
freeSubscriberCount Approximate count of free subscribers as a string.
freeSubscriberCountOrderOfMagnitude Abbreviated representation of subscriber count (e.g. 8.2K+).
rankingDetail Human-readable ranking sentence (e.g. launch timing).
rankingDetailFreeIncluded Human-readable description of audience size (e.g. β€œThousands of subscribers”).
rankingDetailByLanguage Object containing ranking details translated into multiple languages.
theme Nested object describing colors, layout, fonts, and other theme configuration options.
podcastPalette Color palette data derived from podcast artwork, grouped by tonal category.
payments_state Current payments state (e.g. enabled, disabled).
invite_only Whether the publication is invite-only.
explicit Whether the content is marked as explicit.
homepage_type Configuration for how the homepage is structured (e.g. magaziney).
logo_url URL to the main logo image of the publication.
cover_photo_url URL to the publication’s cover or hero image.
rss_feed_url RSS feed URL, if exposed.
rss_website_url Website URL associated with RSS content, if available.
post_reaction_faces_enabled Whether reaction emojis/faces are enabled for posts.
moderation_enabled Whether moderation features are turned on.
navigationBarItems Array of custom navigation items configured for the publication.
sections Array of sections or content groupings configured for the newsletter.
tier Internal tier/level of the publication.
scrapedAt Timestamp when this publication record was last scraped.

Example Output

Example:

[
  {
    "keyword": "sale",
    "name": "From Somewhere with Anna Sale",
    "subdomain": "annasale",
    "base_url": "https://annasale.substack.com",
    "language": "en",
    "created_at": "2023-11-07T19:46:21.152Z",
    "first_post_date": "2023-12-28T15:19:46.188Z",
    "community_enabled": true,
    "has_podcast": false,
    "author_id": 116594,
    "author_name": "Anna Sale",
    "author_handle": "annasale",
    "author_photo_url": "https://substackcdn.com/image/fetch/.../b8a58b1a.jpeg",
    "author_bio": "I'm an interviewer and writer. My podcast is Death, Sex & Money from Slate.",
    "freeSubscriberCount": "8,000",
    "freeSubscriberCountOrderOfMagnitude": "8.2K+",
    "rankingDetail": "Launched a year ago",
    "rankingDetailFreeIncluded": "Thousands of subscribers",
    "theme": {
      "background_pop_color": "#16a34a",
      "web_bg_color": "#ffffff",
      "home_posts": "grid"
    },
    "contributors": [
      {
        "name": "Anna Sale",
        "handle": "annasale",
        "role": "admin",
        "owner": true
      }
    ],
    "scrapedAt": "2025-02-10T05:37:31.758Z"
  }
]

Directory Structure Tree

Example:

Substack Publications Scraper πŸ“š/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ client/
β”‚   β”‚   β”œβ”€β”€ substack_api.py
β”‚   β”‚   └── http_session.py
β”‚   β”œβ”€β”€ extractors/
β”‚   β”‚   β”œβ”€β”€ publications_parser.py
β”‚   β”‚   └── ranking_utils.py
β”‚   β”œβ”€β”€ pipelines/
β”‚   β”‚   β”œβ”€β”€ normalizer.py
β”‚   β”‚   └── deduplicator.py
β”‚   β”œβ”€β”€ storage/
β”‚   β”‚   β”œβ”€β”€ dataset_writer.py
β”‚   β”‚   └── json_exporter.py
β”‚   └── config/
β”‚       └── settings.example.json
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ input.example.json
β”‚   └── sample-output.json
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_extractor.py
β”‚   β”œβ”€β”€ test_normalizer.py
β”‚   └── test_end_to_end.py
β”œβ”€β”€ docs/
β”‚   └── schema.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
└── README.md

Use Cases

  • Content marketers use it to map out relevant Substack newsletters in their niche, so they can identify outreach targets and partnership opportunities based on audience size and positioning.
  • Newsletter founders use it to benchmark competing publications, so they can refine their own positioning, pricing, and content strategy.
  • Media analysts use it to track newsletter growth and category trends, so they can build internal reports and dashboards for stakeholders.
  • Investor and research teams use it to monitor the independent media ecosystem, so they can spot emerging creators and high-signal niches early.
  • Product and platform teams use it to enrich internal knowledge graphs with structured publication data, so they can power recommendation engines and discovery features.

FAQs

Q: What inputs do I need to provide to run this scraper? You primarily need to supply an array of keywords that describe the publications you want to find, plus an optional maxItems parameter to control how many publication records are collected. For most workflows, a handful of well-chosen keywords is enough to build a rich dataset.

Q: What format is the output provided in? The scraper outputs structured JSON records similar to the example above. You can easily convert this JSON into CSV, Excel, database tables, or feed it directly into analytics pipelines and business intelligence tools.

Q: Does this scraper handle multilingual publications? Yes. Each record includes a language field and may also contain rankingDetailByLanguage, which provides localized ranking and subscriber descriptions in several languages, making it easier to analyze non-English markets.

Q: Can I use this for large-scale market research? Absolutely. By combining multiple keyword queries and adjusting maxItems, you can build sizable datasets covering many different categories and niches. Just make sure to respect the target platform’s terms of use and applicable data policies.


Performance Benchmarks and Results

Primary Metric: On typical connections, the scraper can process dozens of Substack publications per minute while collecting full metadata, including nested theme and ranking information.

Reliability Metric: In test runs across varied keyword sets, over 95% of discovered publications yielded complete and valid JSON records without missing critical fields like name, subdomain, or author_name.

Efficiency Metric: The processing pipeline is designed to normalize and write records incrementally, allowing it to handle thousands of publications with modest CPU and memory usage on standard cloud instances.

Quality Metric: Thanks to detailed metadata extraction (author info, subscriber ranges, language-specific ranking details, and theme configuration), the resulting dataset offers high completeness and precision for newsletter analytics, market research, and competitive intelligence workflows.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
β˜…β˜…β˜…β˜…β˜…

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
β˜…β˜…β˜…β˜…β˜…

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
β˜…β˜…β˜…β˜…β˜