webcheck commit: r465 - webcheck/webcheck/parsers/html
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
webcheck commit: r465 - webcheck/webcheck/parsers/html
- From: Commits of the webcheck project <webcheck-commits [at] lists.arthurdejong.org>
- To: webcheck-commits [at] lists.arthurdejong.org
- Reply-to: webcheck-users [at] lists.arthurdejong.org
- Subject: webcheck commit: r465 - webcheck/webcheck/parsers/html
- Date: Wed, 16 Nov 2011 12:19:41 +0100 (CET)
Author: devin
Date: Wed Nov 16 12:19:40 2011
New Revision: 465
URL: http://arthurdejong.org/viewvc/webcheck?revision=465&view=revision
Log:
in old html parser, handle more invalid encodings
Modified:
webcheck/webcheck/parsers/html/htmlparser.py
Modified: webcheck/webcheck/parsers/html/htmlparser.py
==============================================================================
--- webcheck/webcheck/parsers/html/htmlparser.py Wed Nov 16 12:19:15
2011 (r464)
+++ webcheck/webcheck/parsers/html/htmlparser.py Wed Nov 16 12:19:40
2011 (r465)
@@ -258,12 +258,11 @@
# try to decode with the given encoding
if encoding:
try:
- return htmlunescape(unicode(txt, encoding, 'replace'))
+ return htmlunescape(txt.decode(encoding))
except (LookupError, TypeError, ValueError), e:
logger.warn('page has unknown encoding: %s', str(encoding))
# fall back to locale's encoding
- return htmlunescape(unicode(txt, errors='replace'))
-
+ return htmlunescape(txt.decode('ascii', 'replace'))
def parse(content, link):
"""Parse the specified content and extract an url list, a list of images a
@@ -271,7 +270,7 @@
# create parser and feed it the content
parser = _MyHTMLParser(link)
try:
- parser.feed(content)
+ parser.feed(content.decode('ascii', 'ignore').encode())
parser.close()
except Exception, e:
# ignore (but log) all errors
--
To unsubscribe send an email to
webcheck-commits-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-commits/
- webcheck commit: r465 - webcheck/webcheck/parsers/html,
Commits of the webcheck project