Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv.
Web view report - json-viewer.
Demo:
Русское описание ниже
Open https://json-viewer.popstas.pro/. Public server allow to scan up to 100 pages at once.
- Crawls the entire site, collects links to pages and documents
 - Does not follow links outside the scanned domain (configurable)
 - Analyse each page with Lighthouse (see below)
 - Analyse main page text with Mozilla Readability and Yake
 - Search pages with SSL mixed content
 - Scan list of urls, 
--url-list - Set default report fields and filters
 - Scan presets
 - Documents with the extensions 
doc,docx,xls,xlsx,ppt,pptx,pdf,rar,zipare added to the list with a depth == 0 
- Does not load images, css, js (configurable)
 - Each site is saved to a file with a domain name in 
~/site-audit-seo/ - Some URLs are ignored (
preRequestinsrc/scrap-site.js) 
- Fixed table header and url column
 - Add/remove columns
 - Column presets
 - Field groups by categories
 - Filters presets (ex. 
h1_count != 1) - Color validation
 - Verbose page details (
+button) - Direct URL to same report with selected fields, filters, sort
 - Stats for whole scanned pages, validation summary
 - Persistent URL to report when 
--uploadusing - Switch between last uploaded reports
 - Rescan current report
 
- url
 - mixed_content_url
 - canonical
 - is_canonical
 - previousUrl
 - depth
 - status
 - request_time
 - redirects
 - redirected_from
 - title
 - h1
 - page_date
 - description
 - keywords
 - og_title
 - og_image
 - schema_types
 - h1_count
 - h2_count
 - h3_count
 - h4_count
 - canonical_count
 - google_amp
 - images
 - images_without_alt
 - images_alt_empty
 - images_outer
 - links
 - links_inner
 - links_outer
 - text_ratio_percent
 - dom_size
 - html_size
 - html_size_rendered
 - lighthouse_scores_performance
 - lighthouse_scores_pwa
 - lighthouse_scores_accessibility
 - lighthouse_scores_best-practices
 - lighthouse_scores_seo
 - lighthouse_first-contentful-paint
 - lighthouse_speed-index
 - lighthouse_largest-contentful-paint
 - lighthouse_interactive
 - lighthouse_total-blocking-time
 - lighthouse_cumulative-layout-shift
 - and 150 more lighthouse tests!
 
Requires Docker.
Script will clone repository to %LocalAppData%\Programs\site-audit-seo and run service on http://localhost:5302.
curl https://raw.githubusercontent.com/viasite/site-audit-seo/master/install-run.sh | bash
Script will clone repository to $HOME/.local/share/programs/site-audit-seo and run service on http://localhost:5302.
Service will available on http://localhost:5302
- Backend: 
5301 - Frontend: 
5302 - Yake: 
5303 
You can change it in .env file or in docker-compose.yml.
npm install -g site-audit-seonpm install -g site-audit-seo --unsafe-perm=trueAfter installing on Ubuntu, you may need to change the owner of the Chrome directory from root to user.
Run this (replace $USER to your username or run from your user, not from root):
sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"git clone https://github.com/viasite/site-audit-seo
cd site-audit-seo
git clone https://github.com/viasite/site-audit-seo-viewer data/front
docker-compose pull # for skip build step
docker-compose up -dError details Invalid file descriptor to ICU data received.
$ site-audit-seo --help
Usage: site-audit-seo -u https://example.com
Options:
  -u --urls <urls>                  Comma separated url list for scan
  -p, --preset <preset>             Table preset (minimal, seo, seo-minimal, headers, parse, lighthouse,
                                    lighthouse-all) (default: "seo")
  -t, --timeout <timeout>           Timeout for page request, in ms (default: 10000)
  -e, --exclude <fields>            Comma separated fields to exclude from results
  -d, --max-depth <depth>           Max scan depth (default: 10)
  -c, --concurrency <threads>       Threads number (default: by cpu cores)
  --lighthouse                      Appends base Lighthouse fields to preset
  --delay <ms>                      Delay between requests (default: 0)
  -f, --fields <json>               Field in format --field 'title=$("title").text()' (default: [])
  --default-filter <defaultFilter>  Default filter when JSON viewed, example: depth>1
  --no-skip-static                  Scan static files
  --no-limit-domain                 Scan not only current domain
  --docs-extensions <ext>           Comma-separated extensions that will be add to table (default:
                                    doc,docx,xls,xlsx,ppt,pptx,pdf,rar,zip)
  --follow-xml-sitemap              Follow sitemap.xml (default: false)
  --ignore-robots-txt               Ignore disallowed in robots.txt (default: false)
  --url-list                        assume that --url contains url list, will set -d 1 --no-limit-domain
                                    --ignore-robots-txt (default: false)
  --remove-selectors <selectors>    CSS selectors for remove before screenshot, comma separated (default:
                                    ".matter-after,#matter-1,[data-slug]")
  -m, --max-requests <num>          Limit max pages scan (default: 0)
  --influxdb-max-send <num>         Limit send to InfluxDB (default: 5)
  --no-headless                     Show browser GUI while scan
  --remove-csv                      Delete csv after json generate (default: true)
  --remove-json                     Delete json after serve (default: true)
  --no-remove-csv                   No delete csv after generate
  --no-remove-json                  No delete json after serve
  --out-dir <dir>                   Output directory (default: "~/site-audit-seo/")
  --out-name <name>                 Output file name, default: domain
  --csv <path>                      Skip scan, only convert existing csv to json
  --json                            Save as JSON (default: true)
  --no-json                         No save as JSON
  --upload                          Upload JSON to public web (default: false)
  --no-color                        No console colors
  --partial-report <partialReport>
  --lang <lang>                     Language (en, ru, default: system language)
  --no-console-validate             Don't output validate messages in console
  --disable-plugins <plugins>       Comma-separated plugin list (default: [])
  --screenshot                      Save page screenshot (default: false)
  -V, --version                     output the version number
  -h, --help                        display help for command
site-audit-seo -d 1 -u https://example -f 'title=$("title").text()' -f 'h1=$("h1").text()'
site-audit-seo -d 1 -u https://example -f noindex=$('meta[content="noindex,%20nofollow"]').lengthsite-audit-seo -d 1 -u https://example -f title=$('title').text() -f h1=$('h1').text()This will output fields from seo preset excluding canonical fields:
site-audit-seo -u https://example.com --exclude canonical,is_canonicalsite-audit-seo -u https://example.com --preset lighthousesite-audit-seo -u https://example.com --lighthouseYou can copy .site-audit-seo.conf.js to your home directory and tune options.
It is beta feature. How to config:
- Add this to 
~/.site-audit-seo.conf: 
module.exports = {
  influxdb: {
    host: 'influxdb.host',
    port: 8086,
    database: 'telegraf',
    measurement: 'site_audit_seo', // optional
    username: 'user',
    password: 'password',
    maxSendCount: 5, // optional, default send part of pages
  }
};- 
Use
--influxdb-max-sendin terminal. - 
Create command for scan your urls:
 
site-audit-seo -u https://page-with-url-list.txt --url-list --lighthouse --upload --influxdb-max-send 100 >> ~/log/site-audit-seo.log
- Add command to cron.
 
- Readability - main page text length, reading time
 - Yake - keywords extraction from main page text
 
See CONTRIBUTING.md for details about plugin development.
cd data
npm install site-audit-seo-readability
npm install site-audit-seo-yake
You can add argument such: --disable-plugins readability,yake. It more faster, but less data extracted.
Based on headless-chrome-crawler (puppeteer). Used forked version @popstas/headless-chrome-crawler.
- Sometimes it writes identical pages to csv. This happens in 2 cases:
1.1. Redirect from another page to this (solved by setting 
skipRequestedRedirect: true, hardcoded). 1.2. Simultaneous request of the same page in parallel threads. 
- WebSite Auditor (Link Assistant) - desktop app, 500 pages
 - Screaming Frog SEO Spider - desktop app, same as site-audit-seo, 500 pages
 - Seobility - 1 project up to 1000 pages free
 - Neilpatel (Ubersuggest) - 1 project, 150 pages
 - Semrush - 1 project, 100 pages per month free
 - Seoptimer - good for single page analysis
 
- Web Scraper - free for local use extension
 - Portia - self-hosted visual scraper builder, scrapy based
 - Crawlab - distributed web crawler admin platform, self-hosted with Docker
 - OutWit Hub - free edition, pro edition for $99
 - Octoparse - 10 000 records free
 - Parsers.me - 1 000 pages per run free
 - website-scraper - opensource, CLI, download site to local directory
 - website-scraper-puppeteer - same but puppeteer based
 - Gerapy - distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
 - DAXRM Rank Tracker - Real-time Google & Bing SEO Rank tracking
 
Сканирование одного или несколько сайтов в json файл с веб-интерфейсом.
- Обходит весь сайт, собирает ссылки на страницы и документы
 - Сводка результатов после сканирования
 - Документы с расширениями 
doc,docx,xls,xlsx,pdf,rar,zipдобавляются в список с глубиной 0 - Поиск страниц с SSL mixed content
 - Каждый сайт сохраняется в файл с именем домена
 - Не ходит по ссылкам вне сканируемого домена (настраивается)
 - Не загружает картинки, css, js (настраивается)
 - Некоторые URL игнорируются (
preRequestвsrc/scrap-site.js) - Можно прогнать каждую страницу по Lighthouse (см. ниже)
 - Сканирование произвольного списка URL, 
--url-list 
npm install -g site-audit-seonpm install -g site-audit-seo --unsafe-perm=truenpm run postinstall-puppeteer-fix
Или запустите это (замените $USER на вашего юзера, либо запускайте под юзером, не под root):
sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"Подробности ошибки Invalid file descriptor to ICU data received.
site-audit-seo -u https://example.com
Можно передать дополнительные поля так:
site-audit-seo -d 1 -u https://example -f "title=$('title').text()" -f "h1=$('h1').text()"site-audit-seo -u https://example.com --preset lighthousesite-audit-seo -u https://example.com --lighthouse- Открыть в блокноте
 - Документы посчитать поиском 
,0 - Листалки исключить поиском 
? - Вычесть 1 (шапка)
 
- Иногда пишет в csv одинаковые страницы. Это бывает в 2 случаях:
1.1. Редирект с другой страницы на эту (решается установкой 
skipRequestedRedirect: true, сделано). 1.2. Одновременный запрос одной и той же страницы в параллельных потоках. 
- Unique links
 - Offline w3c validation
 - Words count
 - Sentences count
 - Do not load image with non-standard URL, like this
 - External follow links
 - Broken images
 - Breadcrumbs - https://github.com/glitchdigital/structured-data-testing-tool
 - joeyguerra/schema.js - https://gist.github.com/joeyguerra/7740007
 - smhg/microdata-js - https://github.com/smhg/microdata-js
 - indicate page scan error
 - Find broken encoding like 
регионального