Re: Webcheck
[
Date Prev][
Date Next]
[
Thread Prev][
Thread Next]
Re: Webcheck
- From: Arthur de Jong <arthur [at] arthurdejong.org>
- To: Sam Williamson <samwilliamson99 [at] yahoo.co.uk>, webcheck-users [at] lists.arthurdejong.org
- Reply-to: webcheck-users [at] lists.arthurdejong.org
- Subject: Re: Webcheck
- Date: Sun, 15 Mar 2020 15:47:35 +0100
On Wed, 2020-03-11 at 13:10 +0000, Sam Williamson wrote:
> I was interested in using your tool webcheck, but wanted to know if
> it would be able to find links to hacked websites on a website. So
> for example, could you crawl a site like Wikipedia and find all
> external links pointing to a website that is now hacked. I've tried
> to build this myself with Python but struggled. You could look for
> keywords on the sites like "casino", "gambling", "porn" and chinese
> characters to determine if a site has been hacked or not. Anyway, let
> me know what you think and thanks for all your software
Hi Sam,
The webcheck crawler should be reasonably pluggable so you should be
able to modify it to parse content. I had an idea to also make more of
the parsing pluggable but I never got around to implementing that.
A bigger problem with a site like Wikipedia is the sheer number of
pages that have to be crawled. Wikipedia also published a full database
download
https://en.wikipedia.org/wiki/Wikipedia:Database_download
which means that you don't need to crawl Wikipedia itself which will
save you a lot of time.
Hope this helps,
--
-- arthur - arthur@arthurdejong.org - https://arthurdejong.org/ --
- Webcheck,
Sam Williamson
- Re: Webcheck,
Arthur de Jong