lists.arthurdejong.org
RSS feed

Re: webcheck max depth patch

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: webcheck max depth patch



On 2011-11-04, at 10:13, Arthur de Jong wrote:

> On Wed, 2011-11-02 at 16:42 +0100, Devin Bayer wrote: 
>> The following patch, against SVN trunk, does two things:
> 
> Thanks for the patch.

You're welcome. I have webcheck quite useful and wish more site authors would 
use it too :)

I think a few more patches may come from me soon.

>> 1. Fixes encoding issues with the legacy HTMLParser and SQL
> 
> I don't really understand what you're doing here:
> 
> -        parser.feed(content)
> +        parser.feed(content.decode('ascii', errors='ignore').encode())
> 
> It seems that you are converting from ASCII to the local encoding. The
> encoding should already be used in most places and internally webcheck
> should use unicode strings as much as possible.

What happens is:

1. decode content, treating it as ascii, but ignoring errors. Returns a 
unicode()
2. encode that unicode as ascii, returning a str()

So, basically, if the content is not valid in the local encoding, it now is 
because we re-encoded it but discarded invalid characters. I could send you the 
urls that require this - I think it was when non-ASCII was in the tag names.

> Also, this change:
> 
> -        self.pageproblems.append(PageProblem(message=message))
> +        
> self.pageproblems.append(PageProblem(message=message.decode(errors='replace')))
> 
> I think SQLAlchemy should already handle both strings and unicode
> objects transparently.

You would think so, but SQLAlchemy's error message was very verbose and 
explained you should not pass it 8-bit strings, only ASCII or unicode objects.

Cheers,
Devin
-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/