Re: meta data checking plugin for webcheck

[Date Prev][Date Next] [Thread Prev][Thread Next]

From: Arthur de Jong <arthur [at] arthurdejong.org>
To: webcheck-users [at] lists.arthurdejong.org
Subject: Re: meta data checking plugin for webcheck
Date: Fri, 10 Jan 2014 22:53:51 +0100

On Fri, 2014-01-10 at 15:47 +0100, Fabien Quatravaux wrote:
> I discovered webcheck today and I found it very useful.
> I would like to improve it and develop a plugin to check for meta data
> (meta tags and some schema.org tags), but I can't find any
> documentation about how to write a plugin. Specifically, I would like
> to know if the python parser has already extracted the meta tags, and
> where I can find them.

If you want to add functionality to website I recommend you use the Git
version. The plugin structure is somewhat simpler in that version.

The content parsing code is still mostly hard-coded though (I have some
ideas about also using the plugins for that).

The best place to do this extra parsing for now is probably to add an
extra call at the end of
  webcheck.parsers.html.beautifulsoup
and pass the soup variable that can be queried for HTML structure.

Another thing is that the database schema currently has hard-coded
properties (see webcheck.db). It is probably better to make a
LinkProperty class and use that for title, size, mime type, etc. That
would also be a good place to store other meta data of crawled pages.

If you have any code to share, I'm willing to integrate it into
webcheck.

Thanks,

-- 
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org/ --

-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/

meta data checking plugin for webcheck, Fabien Quatravaux
- Re: meta data checking plugin for webcheck, Arthur de Jong

Prev by Date: meta data checking plugin for webcheck
Next by Date: Not getting correct report from Webcheck
Previous by thread: meta data checking plugin for webcheck
Next by thread: Not getting correct report from Webcheck