lists.arthurdejong.org
RSS feed

Re: Possible Webcheck bug?

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: Possible Webcheck bug?



On Sat, 2011-03-05 at 04:15 +0100, m.v.wesstein wrote:
> So far Webcheck hasn't failed me, even on a positively big site, but now 
> I found a site that gives me problems. The crawler goes through the 
> <baseurl>/dirname/ structure fine, but then starts again with 
> <baseurl>//dirname/ and after that, goes on to <baseurl>////dirname/ 
> before I killed it. Notice the additional slashes between the baseurl 
> and the rest.

Actually,
  <baseurl>/dirname/
and
  <baseurl>//dirname/
are two distinct valid URLs so if the website uses those it could go on
for quite a while.

You could include a rewriterule in Apache to redirect double slashes to
the single-slash equivalent. Currently, webcheck does not have a
facility for that.

> Hence my idea to use Webcheck to get me a list of url's to fetch, then
> strip out all non-baseurl links and have a loop set up in bash to get
> each file.

If you want a mirror you could also use wget with the --mirror option.
It should also crawl the website and copy everything. I don't think wget
has a problem with the double slashes.

> Is it possible to redirect the screen output of Webcheck to a textfile 
> with the >> operand? I think so, but I'm not sure.

If you just want to have a list of URLs the easiest would be to quickly
parse the webcheck.dat file. To get all crawled URLs with sed you can
do:
  sed -n 's/^\[\(.*\)\]/\1/p' webcheck.dat | sort -u
(the sort -u is needed because URLs may appear more than once in
webcheck.dat)

-- 
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org --
-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users