lists.arthurdejong.org
RSS feed

Re: Warning while parsing

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: Warning while parsing



On Tue, 2011-11-08 at 16:03 +0100, Jaroslav Lhotak wrote: 
> Firstly I’d like to thank for your tool – I started to experimentally
> use it on our sites (www.svobodanews.ru)

Thanks.

> And here’s what I’m getting pretty often. Is this something what
> should concern me – problem of the page or the way you parse it?
[...] 
> webcheck: Warning: problem parsing page: 'ascii' codec can't decode byte 0xc3 
> in position 44: ordinal not in range(128)
> Traceback (most recent call last):
>   File "/usr/share/webcheck/crawler.py", line 549, in fetch
>     parsermodule.parse(content, self)
>   File "/usr/share/webcheck/parsers/html/__init__.py", line 121, in parse
>     calltidy.parse(content, link)
>   File "/usr/share/webcheck/parsers/html/calltidy.py", line 35, in parse
>     link.add_pageproblem(parsers.html.htmlunescape(unicode(err)))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44: 
> ordinal not in range(128)

The bug here is that the tidy plugin generates an error message that
cannot be converted to Unicode for some reason, probably because it
contains accented characters. The error message is related to the part
of the page that has the links to facebook, twitter, etc.

The link to my.ya.ru does not properly escape the title of the page
which causes some russian text to be parsed as it they were HTML
attributes.

I've fixed the tidy issue in the development version:
  http://arthurdejong.org/viewvc/webcheck?revision=460&view=revision
The change should also be usable for the 1.10.4 version.

Thanks, 
-- 
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org --
-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/