This repository contains a python script that reads a site map, extract its URLs.
It was need some customization code because NAU STAGE environment has a basic authentication access to prevent web search engines to index that data.
Create a virtual environment.
virtualenv venv --python=python3
. venv/bin/activateInstall the package requirements in the virtual environment.
pip install -r requirements.txt| Parameter | Required Description |
|---|---|
| url | True |
--user |
False |
--pass |
False |
--remove_host |
if passsed it removes the protocol and hostname on the output |
For WordPress the sitemap is located on /sitemap_index.xml but on Richie it's located on /sitemap.xml. Example:
Export STAGE environment that has Richie:
python export.py https://www.stage.nau.fccn.pt/sitemap.xml --user <USER> --password <PASSWORD> --remove_host true > stage.txtExport PROD environment that has WordPress:
python export.py https://www.nau.edu.pt/sitemap_index.xml --remove_host true > prod.txtThen you can use a comparation program, like diff, meld, etc. to compare both files.