Extract comprehensive Substack publication data from keyword-based search results, including newsletter metadata, authors, subscriber metrics, and theme settings. This Substack publications scraper helps analysts, marketers, and creators quickly map the newsletter landscape and benchmark audience reach.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for substack-publications-scraper you've just found your team β Letβs Chat. ππ
Substack Publications Scraper lets you search Substack by keywords and collect structured information about matching newsletters. Instead of manually opening each publication, you get key details such as author profile, subscriber ranges, creation dates, and engagement-related settings in one dataset.
This project is ideal for content marketers, media analysts, newsletter founders, and data teams who need a reliable way to analyze Substack publications at scale, perform competitive research, or build internal dashboards around newsletter performance.
- Discover relevant Substack publications using one or more keyword queries.
- Capture rich publication metadata, including tags, descriptions, and hero text.
- Track subscriber counts and ranking details across multiple languages.
- Analyze theme, layout, and feature flags to understand how top newsletters are configured.
- Export clean, machine-readable data for use in BI tools, CRMs, and custom automations.
| Feature | Description |
|---|---|
| Keyword-based publication search | Find Substack publications by one or more keywords and collect structured results automatically. |
| Detailed publication metadata | Capture names, descriptions, hero text, domains, status flags, and configuration options for each publication. |
| Author profiles & contributors | Extract author names, bios, handles, profile photos, and contributor roles to understand who runs each newsletter. |
| Subscriber & ranking insights | Collect ranking detail text, approximate free subscriber counts, and language-specific ranking descriptors. |
| Theme & appearance settings | Retrieve theme colors, homepage layout, hero behavior, and other UI-related configuration values. |
| Community & podcast flags | See whether community features, podcasts, and chat/notes content are enabled for each publication. |
| JSON dataset output | Store all results in structured JSON records suitable for analysis, enrichment, and downstream pipelines. |
| Configurable volume | Control how many publications to fetch per run using a simple maxItems input parameter. |
| Field Name | Field Description |
|---|---|
| keyword | The search term that produced this publication result. |
| name | The public name of the newsletter/publication. |
| hero_text | Introductory tagline or hero text shown on the publication homepage. |
| description | Brief description or summary text of the publication (if available). |
| subdomain | Substack subdomain used by the publication. |
| base_url | Canonical web URL of the publication homepage. |
| hostname | Hostname used for the publication (useful for domain grouping). |
| language | Primary language code of the publication (e.g. en, de, es). |
| created_at | Timestamp indicating when the publication object was created. |
| first_post_date | Date/time of the first published post for this newsletter. |
| type | Type of publication (e.g. newsletter). |
| community_enabled | Boolean flag indicating whether community features are enabled. |
| has_community_content | Whether the publication currently has any community content. |
| has_podcast | Whether the publication has any podcast content configured. |
| has_free_podcast | Whether a free podcast feed is available. |
| has_subscriber_only_podcast | Whether subscriber-only podcast episodes exist. |
| author_id | Internal numeric identifier for the main author. |
| author_name | Full display name of the primary author. |
| author_handle | Public handle/slug of the author on the platform. |
| author_photo_url | Direct URL to the author profile image. |
| author_bio | Short biography text describing the author. |
| contributors | Array of contributor objects (name, handle, role, bio, photo URL, owner flag, user ID). |
| freeSubscriberCount | Approximate count of free subscribers as a string. |
| freeSubscriberCountOrderOfMagnitude | Abbreviated representation of subscriber count (e.g. 8.2K+). |
| rankingDetail | Human-readable ranking sentence (e.g. launch timing). |
| rankingDetailFreeIncluded | Human-readable description of audience size (e.g. βThousands of subscribersβ). |
| rankingDetailByLanguage | Object containing ranking details translated into multiple languages. |
| theme | Nested object describing colors, layout, fonts, and other theme configuration options. |
| podcastPalette | Color palette data derived from podcast artwork, grouped by tonal category. |
| payments_state | Current payments state (e.g. enabled, disabled). |
| invite_only | Whether the publication is invite-only. |
| explicit | Whether the content is marked as explicit. |
| homepage_type | Configuration for how the homepage is structured (e.g. magaziney). |
| logo_url | URL to the main logo image of the publication. |
| cover_photo_url | URL to the publicationβs cover or hero image. |
| rss_feed_url | RSS feed URL, if exposed. |
| rss_website_url | Website URL associated with RSS content, if available. |
| post_reaction_faces_enabled | Whether reaction emojis/faces are enabled for posts. |
| moderation_enabled | Whether moderation features are turned on. |
| navigationBarItems | Array of custom navigation items configured for the publication. |
| sections | Array of sections or content groupings configured for the newsletter. |
| tier | Internal tier/level of the publication. |
| scrapedAt | Timestamp when this publication record was last scraped. |
Example:
[
{
"keyword": "sale",
"name": "From Somewhere with Anna Sale",
"subdomain": "annasale",
"base_url": "https://annasale.substack.com",
"language": "en",
"created_at": "2023-11-07T19:46:21.152Z",
"first_post_date": "2023-12-28T15:19:46.188Z",
"community_enabled": true,
"has_podcast": false,
"author_id": 116594,
"author_name": "Anna Sale",
"author_handle": "annasale",
"author_photo_url": "https://substackcdn.com/image/fetch/.../b8a58b1a.jpeg",
"author_bio": "I'm an interviewer and writer. My podcast is Death, Sex & Money from Slate.",
"freeSubscriberCount": "8,000",
"freeSubscriberCountOrderOfMagnitude": "8.2K+",
"rankingDetail": "Launched a year ago",
"rankingDetailFreeIncluded": "Thousands of subscribers",
"theme": {
"background_pop_color": "#16a34a",
"web_bg_color": "#ffffff",
"home_posts": "grid"
},
"contributors": [
{
"name": "Anna Sale",
"handle": "annasale",
"role": "admin",
"owner": true
}
],
"scrapedAt": "2025-02-10T05:37:31.758Z"
}
]
Example:
Substack Publications Scraper π/
βββ src/
β βββ main.py
β βββ client/
β β βββ substack_api.py
β β βββ http_session.py
β βββ extractors/
β β βββ publications_parser.py
β β βββ ranking_utils.py
β βββ pipelines/
β β βββ normalizer.py
β β βββ deduplicator.py
β βββ storage/
β β βββ dataset_writer.py
β β βββ json_exporter.py
β βββ config/
β βββ settings.example.json
βββ data/
β βββ input.example.json
β βββ sample-output.json
βββ tests/
β βββ test_extractor.py
β βββ test_normalizer.py
β βββ test_end_to_end.py
βββ docs/
β βββ schema.md
βββ requirements.txt
βββ pyproject.toml
βββ README.md
- Content marketers use it to map out relevant Substack newsletters in their niche, so they can identify outreach targets and partnership opportunities based on audience size and positioning.
- Newsletter founders use it to benchmark competing publications, so they can refine their own positioning, pricing, and content strategy.
- Media analysts use it to track newsletter growth and category trends, so they can build internal reports and dashboards for stakeholders.
- Investor and research teams use it to monitor the independent media ecosystem, so they can spot emerging creators and high-signal niches early.
- Product and platform teams use it to enrich internal knowledge graphs with structured publication data, so they can power recommendation engines and discovery features.
Q: What inputs do I need to provide to run this scraper?
You primarily need to supply an array of keywords that describe the publications you want to find, plus an optional maxItems parameter to control how many publication records are collected. For most workflows, a handful of well-chosen keywords is enough to build a rich dataset.
Q: What format is the output provided in? The scraper outputs structured JSON records similar to the example above. You can easily convert this JSON into CSV, Excel, database tables, or feed it directly into analytics pipelines and business intelligence tools.
Q: Does this scraper handle multilingual publications?
Yes. Each record includes a language field and may also contain rankingDetailByLanguage, which provides localized ranking and subscriber descriptions in several languages, making it easier to analyze non-English markets.
Q: Can I use this for large-scale market research?
Absolutely. By combining multiple keyword queries and adjusting maxItems, you can build sizable datasets covering many different categories and niches. Just make sure to respect the target platformβs terms of use and applicable data policies.
Primary Metric: On typical connections, the scraper can process dozens of Substack publications per minute while collecting full metadata, including nested theme and ranking information.
Reliability Metric: In test runs across varied keyword sets, over 95% of discovered publications yielded complete and valid JSON records without missing critical fields like name, subdomain, or author_name.
Efficiency Metric: The processing pipeline is designed to normalize and write records incrementally, allowing it to handle thousands of publications with modest CPU and memory usage on standard cloud instances.
Quality Metric: Thanks to detailed metadata extraction (author info, subscriber ranges, language-specific ranking details, and theme configuration), the resulting dataset offers high completeness and precision for newsletter analytics, market research, and competitive intelligence workflows.