lists.arthurdejong.org
RSS feed

webcheck commit: r465 - webcheck/webcheck/parsers/html

[Date Prev][Date Next] [Thread Prev][Thread Next]

webcheck commit: r465 - webcheck/webcheck/parsers/html



Author: devin
Date: Wed Nov 16 12:19:40 2011
New Revision: 465
URL: http://arthurdejong.org/viewvc/webcheck?revision=465&view=revision

Log:
in old html parser, handle more invalid encodings

Modified:
   webcheck/webcheck/parsers/html/htmlparser.py

Modified: webcheck/webcheck/parsers/html/htmlparser.py
==============================================================================
--- webcheck/webcheck/parsers/html/htmlparser.py        Wed Nov 16 12:19:15 
2011        (r464)
+++ webcheck/webcheck/parsers/html/htmlparser.py        Wed Nov 16 12:19:40 
2011        (r465)
@@ -258,12 +258,11 @@
     # try to decode with the given encoding
     if encoding:
         try:
-            return htmlunescape(unicode(txt, encoding, 'replace'))
+            return htmlunescape(txt.decode(encoding))
         except (LookupError, TypeError, ValueError), e:
             logger.warn('page has unknown encoding: %s', str(encoding))
     # fall back to locale's encoding
-    return htmlunescape(unicode(txt, errors='replace'))
-
+    return htmlunescape(txt.decode('ascii', 'replace'))
 
 def parse(content, link):
     """Parse the specified content and extract an url list, a list of images a
@@ -271,7 +270,7 @@
     # create parser and feed it the content
     parser = _MyHTMLParser(link)
     try:
-        parser.feed(content)
+        parser.feed(content.decode('ascii', 'ignore').encode())
         parser.close()
     except Exception, e:
         # ignore (but log) all errors
-- 
To unsubscribe send an email to
webcheck-commits-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-commits/