One of the problems I met when I started my SEO adventures was to generate a sitemap for my forum. I used google sitemap_gen for the sites, but the forum generated a lot of urls for the same page/content which of course is not a good thing.
So I searched for a solution and found one, though I had to make some modifications to googles sitemap_gen in order for things to work correctly.
Long story short, here's how to do it:

  • download and install linklint from here
  • download my modified sitemap_gen from here
  • create a sitemap_gen config file for your site/forum, based on access logs (in case your host multiple sites, then configure your http server to log each site to it's own log file). If you don't modify the script below, the name should be of form "config-www.domain.tld.xml"
  • add a crontab entry to logrotate your http server access logs at a low peak time (probably sometime at night). this will make sure that sitemap_gen will not process the same url a lot of times. also, a big log file is processed slower than a smaller one ;)
  • add another crontab entry for this script to be run about 5 minutes after the logrotate (update paths and config name accordingly)
here is a small example to make things more clear (based on my domain):
I assume the following:
  1. linklint is reachable from the path (default install)
  2. sitemap_gen is installed in /www/sitemap-gen/sitemap_gen-1.4/
  3. config path and file name is /www/sitemap-gen/sitemap_gen-1.4/
  4. sript is in sitemap_gen directory (/www/sitemap-gen/sitemap_gen-1.4/)
  5. httpd (apache) is installed in /etc/httpd
  6. httpd logs are in /etc/httpd/logs
  7. httpd access logs for site are generated in /etc/httpd/logs/
Example config file for phpbb 2 forum (right click and download) located in web directory /forum/ on my site:
Example crontab setup:
0 1 * * * /root/
10 1 * * * /www/sitemap-gen/sitemap_gen-1.4/
Example script download
Example httpd logrotate config file download (this needs to be in /etc/logrotate.d/)