webcheck commit: r465 - webcheck/webcheck/parsers/html

[Date Prev][Date Next] [Thread Prev][Thread Next]

From: Commits of the webcheck project <webcheck-commits [at] lists.arthurdejong.org>
To: webcheck-commits [at] lists.arthurdejong.org
Reply-to: webcheck-users [at] lists.arthurdejong.org
Subject: webcheck commit: r465 - webcheck/webcheck/parsers/html
Date: Wed, 16 Nov 2011 12:19:41 +0100 (CET)

Author: devin
Date: Wed Nov 16 12:19:40 2011
New Revision: 465
URL: http://arthurdejong.org/viewvc/webcheck?revision=465&view=revision

Log:
in old html parser, handle more invalid encodings

Modified:
   webcheck/webcheck/parsers/html/htmlparser.py

Modified: webcheck/webcheck/parsers/html/htmlparser.py
==============================================================================
--- webcheck/webcheck/parsers/html/htmlparser.py        Wed Nov 16 12:19:15 
2011        (r464)
+++ webcheck/webcheck/parsers/html/htmlparser.py        Wed Nov 16 12:19:40 
2011        (r465)
@@ -258,12 +258,11 @@
     # try to decode with the given encoding
     if encoding:
         try:
-            return htmlunescape(unicode(txt, encoding, 'replace'))
+            return htmlunescape(txt.decode(encoding))
         except (LookupError, TypeError, ValueError), e:
             logger.warn('page has unknown encoding: %s', str(encoding))
     # fall back to locale's encoding
-    return htmlunescape(unicode(txt, errors='replace'))
-
+    return htmlunescape(txt.decode('ascii', 'replace'))
 
 def parse(content, link):
     """Parse the specified content and extract an url list, a list of images a
@@ -271,7 +270,7 @@
     # create parser and feed it the content
     parser = _MyHTMLParser(link)
     try:
-        parser.feed(content)
+        parser.feed(content.decode('ascii', 'ignore').encode())
         parser.close()
     except Exception, e:
         # ignore (but log) all errors
-- 
To unsubscribe send an email to
webcheck-commits-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-commits/

webcheck commit: r465 - webcheck/webcheck/parsers/html, Commits of the webcheck project

Prev by Date: webcheck commit: r464 - webcheck/webcheck
Next by Date: webcheck commit: r466 - in webcheck: . webcheck
Previous by thread: webcheck commit: r464 - webcheck/webcheck
Next by thread: webcheck commit: r466 - in webcheck: . webcheck