lists.arthurdejong.org
RSS feed

Re: Webcheck

[Date Prev][Date Next] [Thread Prev][Thread Next]

Re: Webcheck



On Wed, 2020-03-11 at 13:10 +0000, Sam Williamson wrote:
> I was interested in using your tool webcheck, but wanted to know if
> it would be able to find links to hacked websites on a website. So
> for example, could you crawl a site like Wikipedia and find all
> external links pointing to a website that is now hacked. I've tried
> to build this myself with Python but struggled. You could look for
> keywords on the sites like "casino", "gambling", "porn" and chinese
> characters to determine if a site has been hacked or not. Anyway, let
> me know what you think and thanks for all your software

Hi Sam,

The webcheck crawler should be reasonably pluggable so you should be
able to modify it to parse content. I had an idea to also make more of
the parsing pluggable but I never got around to implementing that.

A bigger problem with a site like Wikipedia is the sheer number of
pages that have to be crawled. Wikipedia also published a full database
download
  https://en.wikipedia.org/wiki/Wikipedia:Database_download
which means that you don't need to crawl Wikipedia itself which will
save you a lot of time.

Hope this helps,

-- 
-- arthur - arthur@arthurdejong.org - https://arthurdejong.org/ --