lists.arthurdejong.org
RSS feed

Example for HTML encoding patch

[Date Prev][Date Next] [Thread Prev][Thread Next]

Example for HTML encoding patch



Hi Arthur,

Here is an example why the HTML parser encoding patch is needed:

$ webcheck/run.py --ignore-robots -d -o html/tmp2-20111109-1329 -l0 
http://trac.bewelcome.org/
webcheck: INFO: checking site....
webcheck: INFO: http://trac.bewelcome.org/
webcheck: DEBUG: crawler.Link.set_encoding('utf-8')
webcheck: DEBUG: parsing using webcheck.parsers.html
webcheck: WARNING: falling back to the legacy HTML parser, consider installing 
BeautifulSoup
webcheck: ERROR: caught exception: 'ascii' codec can't decode byte 0xe2 in 
position 0: ordinal not in range(128)
Traceback (most recent call last):
  File "/home/dev/linkcheck/webcheck/webcheck/parsers/html/htmlparser.py", line 
274, in parse
    parser.feed(content)
  File "/usr/lib64/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib64/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.6/HTMLParser.py", line 249, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib64/python2.6/HTMLParser.py", line 387, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib64/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal 
not in range(128)
webcheck: DEBUG: html encoding: utf-8

~ devin
-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/