lists.arthurdejong.org
RSS feed

Re: meta data checking plugin for webcheck

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: meta data checking plugin for webcheck



On Fri, 2014-01-10 at 15:47 +0100, Fabien Quatravaux wrote:
> I discovered webcheck today and I found it very useful.
> I would like to improve it and develop a plugin to check for meta data
> (meta tags and some schema.org tags), but I can't find any
> documentation about how to write a plugin. Specifically, I would like
> to know if the python parser has already extracted the meta tags, and
> where I can find them.

If you want to add functionality to website I recommend you use the Git
version. The plugin structure is somewhat simpler in that version.

The content parsing code is still mostly hard-coded though (I have some
ideas about also using the plugins for that).

The best place to do this extra parsing for now is probably to add an
extra call at the end of
  webcheck.parsers.html.beautifulsoup
and pass the soup variable that can be queried for HTML structure.

Another thing is that the database schema currently has hard-coded
properties (see webcheck.db). It is probably better to make a
LinkProperty class and use that for title, size, mime type, etc. That
would also be a good place to store other meta data of crawled pages.

If you have any code to share, I'm willing to integrate it into
webcheck.

Thanks,

-- 
-- arthur - arthur@arthurdejong.org - http://arthurdejong.org/ --
-- 
To unsubscribe send an email to
webcheck-users-unsubscribe@lists.arthurdejong.org or see
http://lists.arthurdejong.org/webcheck-users/