lists.arthurdejong.org
RSS feed

Possible Webcheck bug?

[Date Prev][Date Next] [Thread Prev][Thread Next]

Possible Webcheck bug?



Hello again

Before submitting a bug report, I'll just ask in case you're already aware.

So far Webcheck hasn't failed me, even on a positively big site, but now I found a site that gives me problems. The crawler goes through the <baseurl>/dirname/ structure fine, but then starts again with <baseurl>//dirname/ and after that, goes on to <baseurl>////dirname/ before I killed it. Notice the additional slashes between the baseurl and the rest.

Admittedly the site isn't mine, but the owner is currently seriously ill and in hospice care and expectation is he's not to survive for much longer. The gentleman has build up a considerable wealth of info in his field of expertise and it would be a real shame if it was lost. Hence my idea to use Webcheck to get me a list of url's to fetch, then strip out all non-baseurl links and have a loop set up in bash to get each file.

The following links have been tested, with the same results, pointing to the same site:
http://carendt.us/
http://www.carendt.us/
http://www.carendt.com/

Webcheck version: 1.10.4 on Debian Lenny. I have the webcheck.dat files (uncompressed) but not in debug mode I'm afraid...

Is it possible to redirect the screen output of Webcheck to a textfile with the >> operand? I think so, but I'm not sure.

If you need more info let me know, I'll get it to you ASAP (but it may take a while as I have to work over the weekend...)

Regards, Vincent Wesstein
the Netherlands
--
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users