Location:
Search - crawl java
Search list
Description: 自己做的类似于网络爬虫的东西
能实现整个网站的抓取,暂时还不支持javascript形式的连接
能抓取网页,网页中的所有的URL重新生成,图片,文件,包括所有格式的文件,全部都能保持原有的路径结构
-own network similar to the reptiles can achieve something the entire site crawls, javascript not yet support forms of connectivity can crawl website, website URL all the new generation, photographs, documents, including all files, all we can maintain the original structure of the path
Platform: |
Size: 783360 |
Author: 三水 |
Hits:
Description: 1、锁定某个主题抓取;
2、能够产生日志文本文件,格式为:时间戳(timestamp)、URL;
3、抓取某一URL时最多允许建立2个连接(注意:本地作网页解析的线程数则不限)
4、遵守文明蜘蛛规则:必须分析robots.txt文件和meta tag有无限制;一个线程抓完一个网页后要sleep 2秒钟;
5、能对HTML网页进行解析,提取出链接URL,能判别提取的URL是否已处理过,不重复解析已crawl过的网页;
6、能够对spider/crawler程序的一些基本参数进行设置,包括:抓取深度(depth)、种子URL等;
7、使用User-agent向服务器表明自己的身份;
8、产生抓取统计信息:包括抓取速度、抓取完成所需时间、抓取网页总数;重要变量和所有类、方法加注释;
9、请遵守编程规范,如类、方法、文件等的命名规范,
10、可选:GUI图形用户界面、web界面,通过界面管理spider/crawler,包括启停、URL增删等
-1, the ability to lock a particular theme crawls; 2, can produce log text file format : timestamp (timestamp), the URL; 3. crawls up a URL to allow for the establishment of two connecting (Note : local website for a few analytical thread is not limited) 4, abide by the rules of civilized spiders : to be analyzed robots.txt file and meta tag unrestricted; End grasp a thread after a website to sleep two seconds; 5, capable of HTML pages for analysis, Links to extract URL, the extract can judge whether the URL have been processed. Analysis has not repeat crawl over the web; 6. to the spider/crawler some of the basic procedures for setting up parameters, including : Grasp depth (depth), seeds URL; 7. use User-agent to the server to identify themselves; 8, crawls produce statistical informati
Platform: |
Size: 1911808 |
Author: |
Hits:
Description: 股票实时行情数据查看,演示从新浪网上抓取实时股票行情。-Real-time stock market data viewing, presentations from the Sina-line real-time stock quotes crawl.
Platform: |
Size: 13312 |
Author: lijie |
Hits:
Description: 利用JAVA实现的网络蜘蛛,具有从网络抓取网页的功能-Realize the use of JAVA web spiders, crawl the page from the network with the function of
Platform: |
Size: 22528 |
Author: 张涛 |
Hits:
Description: 一个抓取程序,可以对有关的基金网站的公布的基金进行抓取显示-A crawling process, the Fund can be on the site of the funds released to crawl display
Platform: |
Size: 21192704 |
Author: wujun |
Hits:
Description: 一个java实现的有界面的email发送程序。可以从网络上抓取email。也可以从文件中读取email-Realize a java interface has the email send process. Crawl from the network email. Can also be read from the file email
Platform: |
Size: 2135040 |
Author: hanchengfeng |
Hits:
Description: heritrix是一种开源的网络爬虫/网络蜘蛛,heritrix目的是能够跟踪页面的url进行扩展的抓取,最后为搜索引擎提供广泛的数据来源。-heritrix is an open source network reptiles/Web Spiders, heritrix purpose is to track the page url to the expansion of the crawl, and finally for the search engine provides a wide range of data sources.
Platform: |
Size: 9784320 |
Author: 傅志诚 |
Hits:
Description: 用来为垂直搜索引擎抓取数据的采集系统,采用多线程。智能界面化控制,想抓取的战点或内容简单配置一下即可以运行,采集来的数据自动保存到数据库。数据库可自行配置。-Used for vertical search engines to crawl the data acquisition system, using multi-threaded. Intelligent interface of control, would like to crawl the content of the war, or a simple configuration that can run about, collecting data automatically saved to the database. Its own configuration database.
Platform: |
Size: 23880704 |
Author: wdwew |
Hits:
Description: web 网络爬虫 用户可以使用它从网络上抓取想要得资源,开发者还可以扩展它的各个组件,来实现自己的抓取逻辑。-Reptile web network users can use it from the network you want to crawl resources, developers can also extend its various components, to achieve their own logic crawl.
Platform: |
Size: 19386368 |
Author: echoli |
Hits:
Description: SSHMail Ajax方式提交,自动抓取页面内容,统计关键字个数.-SSHMail Ajax submitted automatically crawl the page content, the number of statistical keyword.
Platform: |
Size: 10762240 |
Author: Grass |
Hits:
Description: java写的网络抓包程序,可以对抓取的数据包进行分析,并且将IP头里的信息存储到ACCESS数据库中-java write network capture process can crawl packet analysis, and IP information in advance to the ACCESS database storage
Platform: |
Size: 431104 |
Author: 王娟娟 |
Hits:
Description: java写的用来抓取email -java written email to crawl
Platform: |
Size: 1024 |
Author: xxc |
Hits:
Description: 中文自动分类。使用spider抓取网络信息,利用lucene的分词及KNN方法。-Chinese automatic classification. The use of spider crawl network information, the use of Lucene sub-word and KNN methods.
Platform: |
Size: 8192 |
Author: TZH |
Hits:
Description: 解析html网页,可以抓取网页中的部分内容-Analysis of html pages, you can crawl the content of some of the page
Platform: |
Size: 56320 |
Author: 小旭 |
Hits:
Description: 是用纯Java开发的,用来进行网站镜像抓取的工具,可以使用配制文件中提供的URL入口,把这个网站所有的能用浏览器通过GET的方式获取到的资源全部抓取到本地,包括网页和各种类型的文件,如:图片、flash、mp3、zip、rar、exe等文件。可以将整个网站完整地下传至硬盘内,并能保持原有的网站结构精确不变。只需要把抓取下来的网站放到web服务器(如:Apache)中,就可以实现完整的网站镜像。-Is developed in pure Java, used to crawl Web site mirroring tool, you can use the preparation of documents to provide the URL of the entrance to the site the browser can be used all the way through GET access to the resources of all the crawling to the local, including web pages, and various types of documents, such as: images, flash, mp3, zip, rar, exe and other documents. Integrity of the entire site can be spread to the hard disk inside the underground, and to preserve the present structure of the site remain accurate. Just down the site to crawl on the web server (eg: Apache), they can achieve a complete mirror site.
Platform: |
Size: 4943872 |
Author: blackieliu |
Hits:
Description: web crawler, 一个java的爬虫。-web crawler
Platform: |
Size: 193536 |
Author: alajfel |
Hits:
Description: 网络爬虫程序小型 JAVA应用程序 虚妄大家有用的下载-Web crawler false small JAVA application to download all useful
Platform: |
Size: 1024 |
Author: 黄少淳 |
Hits:
Description: 一个用JAVA写的网络爬虫,效率比较高。可以对网页中的URL进行选择性的抓取。-A written using JAVA Web crawler, more efficient. The URL of the page can be selectively crawl.
Platform: |
Size: 141312 |
Author: 田宇辰 |
Hits:
Description: 一个简单的基于Jsoup的HTML信息抓取Java程序,适用于初学者与进阶者。-A simple HTML-based information Jsoup crawl Java programs, for beginners and advanced persons.
Platform: |
Size: 15360 |
Author: 杨二狗 |
Hits:
Description: 爬虫文件,此Java文件可以爬取网页中所有的链接网址。(Crawler files, this Java file can crawl all the linked URLs in the web page.)
Platform: |
Size: 2048 |
Author: 娃娃娃 |
Hits: