lists.arthurdejong.org
RSS feed

Re: handling data urls

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: handling data urls



On Fri, 2013-10-18 at 13:45 -0400, Moses Moore wrote:
> (first attempt at sending message timed-out trying to connect to
> bobo.arthurdejong.org[2001:888:1613::1]:25: Connection timed out)
> (second attempt at sending message failed  "Client host rejected:
> cannot find your hostname")

The first indicates that the mail server is using IPv6 and not failing
over to IPv4 (or has a very low queue lifetime). The second means the
mailserver doesn't have a reverse lookup for its IP address which is
required for many mail servers.

Anyway,

> I noticed my webcheck (v1.10.4) reports were running into the hundreds
> of megabytes of output.  I found out it's because webcheck is still
> processing and reporting on 'data:*' URLs despite using
> `--yank='data:'` on the command line.
> 
> If I'm going to keep using webcheck, it seems I'll need to modify the
> software not to include entire data:* URLs in the reports. It's
> not like data:* URLs need to be checked whether it returns a "404 file
> not found".  It's not as if the entire text of a data:* URL needs
> to be written in a report.
> 
> Is there something stronger than '--yank',
> or will I need to write a patch to handle data:* in a special way?

I'm afraid there currently isn't anything stronger than --yank
implemented. The only thing --yank ensures is that the URL is not
retrieved, not that it is not recorded in any way.

A patch would be very welcome to completely ignore URLs matching certain
patterns.

Recently, webcheck development was picked up again and the Git version
now is almost ready for an initial release. A patch against that version
is very welcome.

Thanks for your email,

-- 
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org/ --
-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/