webcheck branch master updated. 1.10.4-81-g7e158dc
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
webcheck branch master updated. 1.10.4-81-g7e158dc
- From: Commits of the webcheck project <webcheck-commits [at] lists.arthurdejong.org>
- To: webcheck-commits [at] lists.arthurdejong.org
- Reply-to: webcheck-users [at] lists.arthurdejong.org
- Subject: webcheck branch master updated. 1.10.4-81-g7e158dc
- Date: Mon, 25 Nov 2013 23:42:20 +0100 (CET)
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "webcheck".
The branch, master has been updated
via 7e158dc6ddec72a605e5f65e4317d922c442110e (commit)
via 1aa030d50308b1ef23331c420e67915a6a9d18b7 (commit)
from c31be0302aa5ee0ab496f017edfcbf6a4bd3cc92 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
http://arthurdejong.org/git/webcheck/commit/?id=7e158dc6ddec72a605e5f65e4317d922c442110e
commit 7e158dc6ddec72a605e5f65e4317d922c442110e
Author: Arthur de Jong <arthur@arthurdejong.org>
Date: Mon Nov 25 19:08:47 2013 +0100
Update documentation
This updates the README, HACKING and other documentation to be more in
line with the current software set-up. This also updates the TODO list
with current changes.
diff --git a/HACKING b/HACKING
index ac0539d..ea11b40 100644
--- a/HACKING
+++ b/HACKING
@@ -1,28 +1,59 @@
+
+This document tries to describe the software layout and design of
+webcheck. It should provide some help for contributing code to this package.
+
+
+CONTRIBUTING TO WEBCHECK
+========================
+
+Contributions to webcheck are most welcome. Integrating contributions will
+be done on a best-effort basis and can be made easier if the following are
+considered:
+
+* for large changes it is a good idea to send an email first
+* send your patches in unified diff (diff -u) format, Git patches or Git
+ pull requests
+* try to use the svn version of the software to develop the patch
+* clearly state which problem you're trying to solve and how this is
+ accomplished
+* please follow the existing coding conventions
+* please test the patch and include information on testing with the patch
+* add a copyright statement with the patch if you feel the contribution is
+ significant enough (e.g. more than a few lines)
+* when including third-party code, retain copyright information (copyright
+ holder and license) and ensure that the license is GPL compatible
+
+Please email webcheck-users@lists.arthurdejong.org if you want to
+contribute. All contributions will be acknowledged in the AUTHORS file.
+
+
WEBCHECK DESIGN OVERVIEW
========================
-Webcheck has grown and has been refactored over time so there is not really a
-single design. The functions are grouped in modules according to their
-function. This graphs should present a simple overview of the modules and
-order of calling the functions.
+Webcheck has grown and has been refactored over time. While some different
+design concepts were used, recently there has been a push towards a modular
+plugin-based design.
+
+The graphs blowe should give an overview of the modules and order of calling
+the functions.
-webcheck/ - top-level namespace
- \- cmd.py - main program entry point, command line parsing,
etc
- \- config.py - configuration settings (imported from most other
- | modules)
- \- util.py - common functions imported from most other modules
+webcheck - top-level namespace
+ \- cmd - command-line front-end for webcheck
+ \- config - configuration settings (imported from most other
+ | modules, expected to be refactored out)
+ \- crawler - home of the Crawler class that controls the
+ | initialisation, crawling, post-processing and
+ | report generation
+ \- db - database definitions using SQLAlchemy
+ | used to persist the crawled data in a SQLite db
+ \- monkeypatch - hacks to fix third-party bugs
+ \- myurllib - URL normalisation functions
+ \- output - utility functions for report generation
|
- \- crawler.py - module with loop and logic for traversing a
- | | website and storing all the information about
- | | the website that is used later
- \- myurllib.py - module for ftp/file/http url fetching
+ \- parsers - entry point for content parsing
+ | \- html - parser modules for HTML content
+ | \- css - parser module for CSS
|
- \- parsers/__init__.py - front-end module to handle parsing of content
- | \- html/ - parser modules for html content
- | \- css.py - parser module for css (dummy currently)
+ \- plugins - collection of report and post-processing plugins
|
- \- plugins/__init__.py - front-end module for plugin modules, this calls
- | all configured plugins and has some helper
- | functions for plugins
- \- plugins/*.py - per report one plugin that does some specific
- checking and outputs some html code
+ \- templates - HTML templates for report generation
diff --git a/NEWS b/NEWS
index cf4dee8..60dae1f 100644
--- a/NEWS
+++ b/NEWS
@@ -2,7 +2,7 @@ changes from 1.10.4 to 1.10.5 (alpha)
-----------------------------
* added setup.py for pypi/egg-based installation
-* support --levels option to control max depth
+* support --max-depth option to control max depth
* detect and report on endless redirects
* move to sqlite for storing crawler state
diff --git a/README b/README
index 14bd368..adab9a6 100644
--- a/README
+++ b/README
@@ -11,7 +11,7 @@
Copyright (C) 1998, 1999 Albert Hopkins (marduk)
Copyright (C) 2002 Mike W. Meyer
- Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010 Arthur de Jong
+ Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 Arthur de Jong
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -55,15 +55,15 @@ Features of webcheck include:
* list links pointing to external sites
* can run without user intervention
-webcheck is written in Python and is developed on a Debian system with Python
-2.4. Previous versions of Python are not tested regularly. More recent
-versions of Python have much better performance for large sites. Patches to
-support a wider range of Python releases are welcome (provided they are not
-too intrusive).
+webcheck is written in Python and is developed on a Debian system with
+Python 2.7. Previous versions of Python are not tested regularly. Patches
+to support a wider range of Python releases are welcome (provided they are
+not too intrusive).
INSTALLING WEBCHECK
===================
+
This will install the latest version from PyPi.
% easy_install webcheck
@@ -71,6 +71,7 @@ This will install the latest version from PyPi.
MANUAL INSTALLATION
===================
+
Installation is relatively easy. These installation instructions are for
Unix-like systems. Other operating systems may differ.
@@ -101,7 +102,7 @@ Should crawl the site and write the reports to the
/tmp/myreport directory.
The reports are simple HMTL pages that should look fine with most modern
browsers.
-For more information on webcheck usage and command line options see the
+For more information on webcheck usage and command line options, see the
webcheck manual page. If the manual page is not in the MANPATH you can
probably open the manual with something like:
% man -l /opt/webcheck-1.10.4/webcheck.1
diff --git a/TODO b/TODO
index 1e5dee7..2d0664e 100644
--- a/TODO
+++ b/TODO
@@ -2,16 +2,14 @@ before next release
-------------------
* go over all FIXMEs in code (ftp)
* follow redirects (to a limit) of external sites
-* -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
+* -U, --user-agent=AGENT identify as AGENT instead of webcheck VERSION
-probably before 2.0 release
+probably before 3.0 release
---------------------------
* support for multi-threading (use -t, --threads as option)
-* find a fix for redirecting stdout and stderr to work properly
* implement a maximum transfer size for downloading
* support ftp proxies
* support proxying https traffic
-* give problems different levels (info, warning, error) or categories
* option to only force overwrite generated files and leave static files (css,
js) alone
* implement a --html-only option to not copy css and other files
* check for missing encoding (report problem)
@@ -21,48 +19,30 @@ probably before 2.0 release
wishlist
--------
* make code for stripping last part of a url (e.g. foo/index.html -> foo/)
-* maybe set referer (configurable)
-* cookies support (maybe) (not difficult with urllib2)
* integration with weblint
* do form checking of crawled pages
* do spelling checking of crawled pages
-* test w3c conformance of pages
* add support for fetching gzipped content to improve performance
-* maybe do http pipelining
* maybe output a google sitemap file:
http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
* maybe trim titles that are too long
* maybe check that documents referenced in <img> tags are really images
-* maybe split out plugins in check() and generate() functions
-* make FAQ
* use gettext to present output to enable translations of messages and html
* maybe report on embedded content that is external
* present an overview of problem pages: "100 problems in 10 pages" (per author)
* check of email addresses that they are formatted properly and check that
host part has an MX record (make it a problem for no record or only an A record)
-* maybe implement news, nntp, gopher and telnet schemes (if there is anyone
that wants them)
-* maybe add custom bullets in problem lists, depending on problem type
* present age for times long ago in a friendlier format (.. days ago, ..
months ago, .. years ago)
* maybe unescaped spaces aren't always a real problem (e.g. in mailto: urls)
* maybe give a warning for urls that have non-ascii characters
* maybe fetch and store description and other meta information about page
(keywords) (just like author)
-* connect to w3c-markup-validator and tidy (and possibly other tools)
-* find out why title does not show up correctly for file?:// urls if they
contain non-ascii chars
* output scan took so long
-* support unicode strings for all string values in link objects (url, status,
mimetype, encoding, etc)
-* maybe also serialize robotparsers
* maybe also add robots.txt to urllist if fetched successfully
* support CSS encoding:
http://www.w3.org/International/questions/qa-css-charset
* webcheck does not give an error when accessing http://site:443/ ??
-* improve data structures (e.g. see if pop() is faster than pop(0))
-* do not use string for serializing child, embed, anchor and reqanchor as they
are already url-encoded
-* there seem to be some issues with generating site maps for ftp directories
-* document serialized file format in manual page (if it is stabilized)
* look into python-spf to see how DNS queries are done
* implement an option to ignore problems on pages (but do consider internal,
etc) (e.g. for generated or legacy html)
-* maybe use urllib2 instead of our own custom code (redirects may be a problem
here though)
* add support for robots meta tag: http://www.robotstxt.org/wc/meta-user.html
* only report multiple definitions of a single anchor once
* warn if URL contains unencoded characters
* see section 6 of rfc3986.txt for URL comparison (esp. 6.2.2.)
* implement paging for huge reports
-* check out python-coverage
* output timing information on scan (e.g. scan took 30 minutes)
diff --git a/webcheck.1 b/webcheck.1
index e69faae..1e8eda8 100644
--- a/webcheck.1
+++ b/webcheck.1
@@ -1,4 +1,4 @@
-.\" Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010 Arthur de Jong
+.\" Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 Arthur de Jong
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
@@ -131,13 +131,9 @@ Redirect depth. the number of redirects webcheck should
follow when
following a link. 0 implies to follow all redirects.
.TP
-.BI "\-u, \-\-userpass=" "URL"
-Specify a URL with username and password information to use for basic
-authentication when visiting the site.
-.br
-e.g. http://test:secret@example.com/
-.br
-This option may be specified multiple times.
+.BI "\-l, \-\-max-depth=, \-\-levels" "N"
+Recursion depth. The number of links to follow from the base URLs.
+By default links are infinitely followed.
.TP
.BI "\-w, \-\-wait=" "SECONDS"
@@ -161,8 +157,8 @@ Show short summary of options.
URLs are divided into two classes:
.B Internal
-URLs are retrieved and the retrieved item is checked for syntax.
-Also, the retrieved item is searched for links to other items (of any class)
+URLs are retrieved and the retrieved content is checked for problems.
+Also, the retrieved item is searched for links
and these links are followed.
.B External
@@ -217,7 +213,7 @@ Copyright \(co 1998, 1999 Albert Hopkins (marduk)
.br
Copyright \(co 2002 Mike W. Meyer
.br
-Copyright \(co 2005, 2006, 2007, 2008, 2009, 2010 Arthur de Jong
+Copyright \(co 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 Arthur de Jong
.br
webcheck is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
http://arthurdejong.org/git/webcheck/commit/?id=1aa030d50308b1ef23331c420e67915a6a9d18b7
commit 1aa030d50308b1ef23331c420e67915a6a9d18b7
Author: Arthur de Jong <arthur@arthurdejong.org>
Date: Mon Nov 18 19:07:20 2013 +0100
Support older versions of Jinja
This tries to gracefully support older versions of Jinja that don't
provide the trim_blocks, lstrip_blocks or keep_trailing_newline options.
diff --git a/webcheck/output.py b/webcheck/output.py
index 6811338..a163156 100644
--- a/webcheck/output.py
+++ b/webcheck/output.py
@@ -123,8 +123,11 @@ def install_file(source, is_text=False):
env = jinja2.Environment(
loader=jinja2.PackageLoader('webcheck'),
extensions=['jinja2.ext.autoescape'],
- autoescape=True,
- trim_blocks=True, lstrip_blocks=True, keep_trailing_newline=True)
+ autoescape=True)
+# set options that are not supported in older versions of jinja2
+env.trim_blocks = True
+env.lstrip_blocks = True
+env.keep_trailing_newline = True
def render(output_file, **kwargs):
-----------------------------------------------------------------------
Summary of changes:
HACKING | 73 +++++++++++++++++++++++++++++++++++++---------------
NEWS | 2 +-
README | 15 ++++++-----
TODO | 24 ++---------------
webcheck.1 | 18 +++++--------
webcheck/output.py | 7 +++--
6 files changed, 75 insertions(+), 64 deletions(-)
hooks/post-receive
--
webcheck
--
To unsubscribe send an email to
webcheck-commits-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-commits/
- webcheck branch master updated. 1.10.4-81-g7e158dc,
Commits of the webcheck project