Re: Possible Webcheck bug?
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
Re: Possible Webcheck bug?
- From: Arthur de Jong <arthur [at] arthurdejong.org>
- To: webcheck-users [at] lists.arthurdejong.org
- Subject: Re: Possible Webcheck bug?
- Date: Sun, 06 Mar 2011 09:56:58 +0100
On Sat, 2011-03-05 at 04:15 +0100, m.v.wesstein wrote:
> So far Webcheck hasn't failed me, even on a positively big site, but now
> I found a site that gives me problems. The crawler goes through the
> <baseurl>/dirname/ structure fine, but then starts again with
> <baseurl>//dirname/ and after that, goes on to <baseurl>////dirname/
> before I killed it. Notice the additional slashes between the baseurl
> and the rest.
Actually,
<baseurl>/dirname/
and
<baseurl>//dirname/
are two distinct valid URLs so if the website uses those it could go on
for quite a while.
You could include a rewriterule in Apache to redirect double slashes to
the single-slash equivalent. Currently, webcheck does not have a
facility for that.
> Hence my idea to use Webcheck to get me a list of url's to fetch, then
> strip out all non-baseurl links and have a loop set up in bash to get
> each file.
If you want a mirror you could also use wget with the --mirror option.
It should also crawl the website and copy everything. I don't think wget
has a problem with the double slashes.
> Is it possible to redirect the screen output of Webcheck to a textfile
> with the >> operand? I think so, but I'm not sure.
If you just want to have a list of URLs the easiest would be to quickly
parse the webcheck.dat file. To get all crawled URLs with sed you can
do:
sed -n 's/^\[\(.*\)\]/\1/p' webcheck.dat | sort -u
(the sort -u is needed because URLs may appear more than once in
webcheck.dat)
--
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org --
--
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users