Introduction - If you have any usage issues, please Google them yourself
The paperanalyzes typicalopen sourceWeb crawl software, such asNutch, Heritrix, WCT, andWeb-Har-
vest. Following the analyzed result, itputs forward a targetedwebsitesharvestsystem based onNutch. Fourkey issues of
this system are discussed emphatically, which are the initial seedwebsites selection, the harvestprocessmanagement, the
web page contentdenoising, and discovering ofnew seedwebsites.