lists.arthurdejong.org
RSS feed

Warning while parsing

[Date Prev][Date Next] [Thread Prev][Thread Next]

Warning while parsing



Hi,

Firstly I’d like to thank for your tool – I started to experimentally use it on our sites (www.svobodanews.ru)

And here’s what I’m getting pretty often. Is this something what should concern me – problem of the page or the way you parse it?

 

Please find in the attached gzipped webcheck.dat and the command line I ran.

 

Many thanks,

Jarda

 

root@linux:/home/testing/webcheck# ./webcheck.sh

webcheck: checking site....

webcheck:   getting robots.txt for http://www.svobodanews.ru

webcheck:   http://www.svobodanews.ru/

webcheck:   http://www.svobodanews.ru/js__ver2_6.0.0.19935.1/init.jsx

webcheck:   http://www.svobodanews.ru/howtolisten/waves.html

webcheck:   http://www.svobodanews.ru/video/27303.html

webcheck:   http://www.svobodanews.ru/content/article/24383999.html

webcheck: Warning: problem parsing page: 'ascii' codec can't decode byte 0xc3 in position 44: ordinal not in range(128)

Traceback (most recent call last):

  File "/usr/share/webcheck/crawler.py", line 549, in fetch

    parsermodule.parse(content, self)

  File "/usr/share/webcheck/parsers/html/__init__.py", line 121, in parse

    calltidy.parse(content, link)

  File "/usr/share/webcheck/parsers/html/calltidy.py", line 35, in parse

    link.add_pageproblem(parsers.html.htmlunescape(unicode(err)))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44: ordinal not in range(128)

webcheck:   http://www.svobodanews.ru/video/6865a8b3-ff18-4ebd-9e45-eba1c8db44ed.jpgx?w=113&h=64

webcheck:   http://www.svobodanews.ru/img/networking/bw_ybkm.gif

webcheck:   http://www.svobodanews.ru/video/2160905.html?isArticle=1

webcheck:   http://www.svobodanews.ru/jssettings__ver2_6.0.0.19935.1/default.jsx?c=1

webcheck:   http://www.svobodanews.ru/content/article/24382127.html

^Z

[14]+  Stopped                 ./webcheck.sh

 

 

==================

Jaroslav Lhoták

Internet Project Manager - Internet Technology

Radio Free Europe / Radio Liberty Inc.

phone +420-2-2112-2031

http://www.rferl.org/

 

Attachment: webcheck.dat.gz
Description: Binary data

Attachment: webcheck.sh.gz
Description: Binary data

-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/