Re: Warning while parsing
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
Re: Warning while parsing
- From: Arthur de Jong <arthur [at] arthurdejong.org>
- To: Jaroslav Lhotak <lhotakj [at] rferl.org>
- Cc: webcheck-users [at] lists.arthurdejong.org
- Subject: Re: Warning while parsing
- Date: Tue, 08 Nov 2011 23:00:37 +0100
On Tue, 2011-11-08 at 16:03 +0100, Jaroslav Lhotak wrote:
> Firstly I’d like to thank for your tool – I started to experimentally
> use it on our sites (www.svobodanews.ru)
Thanks.
> And here’s what I’m getting pretty often. Is this something what
> should concern me – problem of the page or the way you parse it?
[...]
> webcheck: Warning: problem parsing page: 'ascii' codec can't decode byte 0xc3
> in position 44: ordinal not in range(128)
> Traceback (most recent call last):
> File "/usr/share/webcheck/crawler.py", line 549, in fetch
> parsermodule.parse(content, self)
> File "/usr/share/webcheck/parsers/html/__init__.py", line 121, in parse
> calltidy.parse(content, link)
> File "/usr/share/webcheck/parsers/html/calltidy.py", line 35, in parse
> link.add_pageproblem(parsers.html.htmlunescape(unicode(err)))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44:
> ordinal not in range(128)
The bug here is that the tidy plugin generates an error message that
cannot be converted to Unicode for some reason, probably because it
contains accented characters. The error message is related to the part
of the page that has the links to facebook, twitter, etc.
The link to my.ya.ru does not properly escape the title of the page
which causes some russian text to be parsed as it they were HTML
attributes.
I've fixed the tidy issue in the development version:
http://arthurdejong.org/viewvc/webcheck?revision=460&view=revision
The change should also be usable for the 1.10.4 version.
Thanks,
--
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org --
--
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/