sitemap_gen is a Python program, that crawls a web site and
outputs a XML sitemap.
It has been created by Vladimir Toncar and documented in http://toncar.cz/opensource/sitemap_gen.html.
This version is a port of his fine program to Python 3.
- Python 3.x
- reppy (https://github.com/seomoz/reppy),
pip3 install reppy
- requests (https://requests.readthedocs.io/), will be installed when installing reppy.
python3 sitemap_gen.py -b doc -b bmp -o test_sitemap.xml http://www.your-site-name.com/index.html
Command Line Arguments
python3 sitemap_gen.py <options> <starting URL> Available options: -h --help Print this text and exit -b <ext> --block <ext> Exclude URLs with the given extension; <ext> must be without the leading dot. The comparison is case insensitive, so for example DOC and doc are treated the same. You can use this option several times to block several extensions. -c <value> --changefreq <value> Set the change frequency. The given value is used in all sitemap entries (maybe a future version of this script will change that). The allowed values are: always, hourly, daily, weekly, monthly, yearly, never. -p <prio> --priority <prio> Set the priority. The value must be from the interval between 0.0 and 1.0. The value will be used in all sitemap entries. -m <value> --max-urls <value> Set the maximum number of URLs to be crawled. The default value is 1000 and the largest value that you can set is 50000 (the script generates only a single sitemap file). -r <value> --ratelimit <value> Set a crawl rate limit [requests / second], zero (the default) results in no crawl rate limitation. -o <file> --output-file <file> Set the name of the geneated sitemap file. The default file name is sitemap.xml.