lists.arthurdejong.org
RSS feed

RE: Warning while parsing

[Date Prev][Date Next] [Thread Prev][Thread Next]

RE: Warning while parsing



Many thanks Arthur! I really appreciate it.
Have a nice day,
Jarda


================== 
Jaroslav Lhoták 
Internet Project Manager - Internet Technology
Radio Free Europe / Radio Liberty Inc.
phone +420-2-2112-2031 
http://www.rferl.org/

-----Original Message-----
From: Arthur de Jong [arthur [at] arthurdejong.org] 
Sent: Tuesday, November 08, 2011 11:01 PM
To: Jaroslav Lhotak
Cc: webcheck-users@lists.arthurdejong.org
Subject: Re: Warning while parsing

On Tue, 2011-11-08 at 16:03 +0100, Jaroslav Lhotak wrote: 
> Firstly I’d like to thank for your tool – I started to experimentally 
> use it on our sites (www.svobodanews.ru)

Thanks.

> And here’s what I’m getting pretty often. Is this something what 
> should concern me – problem of the page or the way you parse it?
[...] 
> webcheck: Warning: problem parsing page: 'ascii' codec can't decode 
> byte 0xc3 in position 44: ordinal not in range(128) Traceback (most recent 
> call last):
>   File "/usr/share/webcheck/crawler.py", line 549, in fetch
>     parsermodule.parse(content, self)
>   File "/usr/share/webcheck/parsers/html/__init__.py", line 121, in parse
>     calltidy.parse(content, link)
>   File "/usr/share/webcheck/parsers/html/calltidy.py", line 35, in parse
>     link.add_pageproblem(parsers.html.htmlunescape(unicode(err)))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
> 44: ordinal not in range(128)

The bug here is that the tidy plugin generates an error message that cannot be 
converted to Unicode for some reason, probably because it contains accented 
characters. The error message is related to the part of the page that has the 
links to facebook, twitter, etc.

The link to my.ya.ru does not properly escape the title of the page which 
causes some russian text to be parsed as it they were HTML attributes.

I've fixed the tidy issue in the development version:
  http://arthurdejong.org/viewvc/webcheck?revision=460&view=revision
The change should also be usable for the 1.10.4 version.

Thanks,
--
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org --

-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/