Example for HTML encoding patch
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
Example for HTML encoding patch
- From: Devin Bayer <l [at] t-0.be>
- To: webcheck-users <webcheck-users [at] lists.arthurdejong.org>
- Subject: Example for HTML encoding patch
- Date: Wed, 9 Nov 2011 13:32:49 +0100
Hi Arthur,
Here is an example why the HTML parser encoding patch is needed:
$ webcheck/run.py --ignore-robots -d -o html/tmp2-20111109-1329 -l0
http://trac.bewelcome.org/
webcheck: INFO: checking site....
webcheck: INFO: http://trac.bewelcome.org/
webcheck: DEBUG: crawler.Link.set_encoding('utf-8')
webcheck: DEBUG: parsing using webcheck.parsers.html
webcheck: WARNING: falling back to the legacy HTML parser, consider installing
BeautifulSoup
webcheck: ERROR: caught exception: 'ascii' codec can't decode byte 0xe2 in
position 0: ordinal not in range(128)
Traceback (most recent call last):
File "/home/dev/linkcheck/webcheck/webcheck/parsers/html/htmlparser.py", line
274, in parse
parser.feed(content)
File "/usr/lib64/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib64/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.6/HTMLParser.py", line 249, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib64/python2.6/HTMLParser.py", line 387, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib64/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal
not in range(128)
webcheck: DEBUG: html encoding: utf-8
~ devin
--
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/
- Example for HTML encoding patch,
Devin Bayer