Welcome![Sign In][Sign Up]
Location:
Search - super io

Search list

[Internet-Network用Java编写HTML文件分析程序

Description:

Java编写HTML文件分析程序

 一、概述

    

    Web服务器的核心是对Html文件中的各标记(Tag)作出正确的分析,一种编程语言的解释程序也是对源文件中的保留字进行分析再做解释的。实际应用中,我们也经常会碰到需要对某一特定类型文件进行要害字分析的情况,比如,需要将某个HTML文件下载并同时下载与之相关的.gif.class等文件,此时就要求对HTML文件中的标记进行分离,找出所需的文件名及目录。在Java出现以前,类似工作需要对文件中的每个字符进行分析,从中找出所需部分,不仅编程量大,且易出错。笔者在近期的项目中利用Java的输入流类StreamTokenizer进行HTML文件的分析,效果较好。在此,我们要实现从已知的Web页面下载HTML文件,对其进行分析后,下载该页面中包含的HTML文件(假如在Frame中)、图像文件和ClassJava Applet)文件。

    

    二、StreamTokenizer

    

    StreamTokenizer即令牌化输入流的作用是将一个输入流中变成令牌流。令牌流中的令牌实体有三类:单词(即多字符令牌)、单字符令牌和空白(包括JavaC/C++中的说明语句)。

    

    StreamTokenizer类的构造器为: StreamTokenizer(InputStream in)

    

    该类有一些公有实例变量:ttypesvalnval ,分别表示令牌类型、当前字符串值和当前数字值。当我们需要取得令牌(即HTML中的标记)之间的字符时,应访问变量sval。而读向下一个令牌的方法是调用nextToken()。方法nextToken()的返回值是int型,共有四种可能的返回:

    

    StreamTokenizer.TT_NUMBER: 表示读到的令牌是数字,数字的值是double型,可以从实例变量nval中读取。

    

    StreamTokenizer.TT_Word: 表示读到的令牌是非数字的单词(其他字符也在其中),单词可以从实例变量sval中读取。

    

    StreamTokenizer.TT_EOL: 表示读到的令牌是行结束符。

    

    假如已读到流的尽头,则nextToken()返回TT_EOF

    

    开始调用nextToken()之前,要设置输入流的语法表,以便使分析器辨识不同的字符。WhitespaceChars(int low, int hi)方法定义没有意义的字符的范围。WordChars(int low, int hi)方法定义构造单词的字符范围。

    

    三、程序实现

    

    1HtmlTokenizer类的实现

    

    对某个令牌流进行分析之前,首先应对该令牌流的语法表进行设置,在本例中,即是让程序分出哪个单词是HTML的标记。下面给出针对我们需要的HTML标记的令牌流类定义,它是StreamTokenizer的子类:

    

    

    import java.io.*;

    import java.lang.String;

    class HtmlTokenizer extends

    StreamTokenizer {

    //定义各标记,这里的标记仅是本例中必须的,

    可根据需要自行扩充

     static int HTML_TEXT=-1;

     static int HTML_UNKNOWN=-2;

     static int HTML_EOF=-3;

     static int HTML_IMAGE=-4;

     static int HTML_FRAME=-5;

     static int HTML_BACKGROUND=-6;

     static int HTML_APPLET=-7;

    

    boolean outsideTag=true; //判定是否在标记之中

    

     //构造器,定义该令牌流的语法表。

     public HtmlTokenizer(BufferedReader r) {

    super(r);

    this.resetSyntax(); //重置语法表

    this.wordChars(0,255); //令牌范围为全部字符

    this.ordinaryChar('< '); //HTML标记两边的分割符

    this.ordinaryChar('>');

     } //end of constrUCtor

    

     public int nextHtml(){

    int token; //令牌

    try{

    switch(token=this.nextToken()){

    case StreamTokenizer.TT_EOF:

    //假如已读到流的尽头,则返回TT_EOF

    return HTML_EOF;

    case '< ': //进入标记字段

    outsideTag=false;

    return nextHtml();

    case '>': //出标记字段

    outsideTag=true;

    return nextHtml();

    case StreamTokenizer.TT_WORD:

    //若当前令牌为单词,判定是哪个标记

    if (allWhite(sval))

     return nextHtml(); //过滤其中空格

    else if(sval.toUpperCase().indexOf("FRAME")

    !=-1 && !outsideTag) //标记FRAME

     return HTML_FRAME;

    else if(sval.toUpperCase().indexOf("IMG")

    !=-1 && !outsideTag) //标记IMG

     return HTML_IMAGE;

    else if(sval.toUpperCase().indexOf("BACKGROUND")

    !=-1 && !outsideTag) //标记BACKGROUND

     return HTML_BACKGROUND;

    else if(sval.toUpperCase().indexOf("APPLET")

    !=-1 && !outsideTag) //标记APPLET

     return HTML_APPLET;

    default:

    System.out.println ("Unknown tag: "+token);

    return HTML_UNKNOWN;

     } //end of case

    }catch(IOException e){

    System.out.println("Error:"+e.getMessage());}

    return HTML_UNKNOWN;

     } //end of nextHtml

    

    protected boolean allWhite(String s){//过滤所有空格

    //实现略

     }// end of allWhite

    

    } //end of class

    

    以上方法在近期项目中测试通过,操作系统为Windows NT4,编程工具使用Inprise Jbuilder3


Platform: | Size: 1066 | Author: tiberxu | Hits:

[assembly languageio_source

Description: super io 监控程序-super io monitoring program
Platform: | Size: 199827 | Author: zy | Hits:

[Othergm82c803c

Description: LGS Prime 3C Super / IO datasheet (gm82c803c)
Platform: | Size: 461741 | Author: Valery Kuryshin | Hits:

[assembly languageio_source

Description: super io 监控程序-super io monitoring program
Platform: | Size: 199680 | Author: zy | Hits:

[Linux-UnixW83977

Description: 华邦的超级IO芯片W83977TF的驱动程序。当初我是搞了好久的,现在拿出来和大家分享。-Winbond Super IO chip W83977TF drivers. At first, I was engaged in a long time, but now out and everyone to share.
Platform: | Size: 2048 | Author: 白继波 | Hits:

[Othergm82c803c

Description: LGS Prime 3C Super / IO datasheet (gm82c803c)-LGS Prime 3C Super/IO datasheet (gm82c803c)
Platform: | Size: 461824 | Author: Valery Kuryshin | Hits:

[SCMbootload

Description: 本例程包括:1,AVR单片机通过串口自编程源码(本例使用ATMEGA644P,ATMEGA8515,通过更改可以适用所有AVR带自编程的单片机),PC机采用"超级终端" 2,AVR单片机读写IO扩展IC8155源代码 3,AVR单片机读写拓普微LCD T8000控制器的源代码;4,ATMEGA644P单片机头文件。-This routine includes: 1, AVR microcontroller programming via the serial port from source (in this case using ATMEGA644P, ATMEGA8515, can be applied by changing the programming of all AVR microcontroller with self), PC machines with a " super terminal" 2, AVR microcontroller to read and write IO expansion IC8155 source code 3, AVR microcontroller Top Micro LCD T8000 controller to read and write the source code 4, ATMEGA644P microcontroller header file.
Platform: | Size: 11264 | Author: 苏诚 | Hits:

[Internet-Networkstyleman_network

Description: styleman_network网络引擎v1.0简要说明: 本网络引擎100 保证数据包完整性,程序健壮无bug,无内存泄露,而且是线程安全的.服务端客户连接无限制. 引擎功能无任何限制.具体使用可参看例子. CreateNetXXX 和DestroyNetXXX 必须成对使用. 创建了当然要销毁. 网络引擎服务端默认30秒超时的心跳包时间.即客户连接到服务端,至少30秒以内要发送一包数据给服务端.可以自己设置心跳超时时间. 网络引擎服务端默认10秒连接超时.即客户只连接到服务端,没发送任何数据.那么超过10秒后,服务端将把客户连接关闭.防止恶意的空连接攻击.可以自己设置连接超时时间. Send一个数据包大小不能超过15KB.超过则发送失败 不要在数据回调函数里耗费太多时间.更不能在里面Sleep.因为那会阻塞,影响其他收到数据的回调.造成程序无法及时响应. 关于性能方面,线程池+select的io和WSASynSelect效率当然比不上iocp,但承受300-400个客户连接的小型网络服务器还是足够的.具体还是需要实际的环境测试,察看连接数,io,使用cpu .然后设置个最大连接数.达到最佳网络io状态. 未经授权,禁止使用本程序用于任何商业软件中.否则将追究法律责任.-styleman_network Network Engine v1.0 Brief Description: The network packet engine 100 guarantee the integrity, the program robust no bug, no memory leaks, and is thread safe. Server unlimited client connections. Engine functions without any restrictions. Specific examples of use can be found. CreateNetXXX and DestroyNetXXX must be in pairs. Created a course to destruction. Network Engine server default 30 second timeout of the heartbeat packet time. The client connects to the server, at least 30 seconds to send a packet of data to the server. You can set up their own heartbeat timeout. Network Engine server connection timeout 10 seconds by default. That clients only connect to the server, did not send any data. Then, after more than 10 seconds, the server will close the client connection. To prevent malicious attacks on the air connection. You can set up their own super-connected when the time. Send a data packet size can not exceed 15KB. Over the transmission
Platform: | Size: 108544 | Author: asdffd | Hits:

[SCM4-65-key

Description: 超强的四IO65按健的程序,少io是可用作为参考,-Super four IO65 procedures by health, less io is available as a reference,
Platform: | Size: 242688 | Author: jackie | Hits:

[Driver Developgpio

Description: linux 下(w83627)super io 的驱动原码。主要是有如何通过lpc总线访问w83627的寄存器-this is w83627 driver code in linux. it will show how to acess to w83627 register on LPC bus.
Platform: | Size: 75776 | Author: zht | Hits:

[Linux-Unixsmsc_fdc37m81x

Description: Interface for smsc fdc48m81x Super IO chip
Platform: | Size: 1024 | Author: gingguntu | Hits:

[SCMAt16-LCDS12864-3bit

Description: 3位128*64液晶显示器实用程序,超省IO口。使用软件 AVR Studio 4-3 128* 64 LCD Utility, super IO port. Using the software AVR Studio 4
Platform: | Size: 24576 | Author: weiliu | Hits:

[Linux-Unixriowd

Description: Linux driver for hw watchdog inside Super IO of RIO.
Platform: | Size: 2048 | Author: vouziecun | Hits:

[Linux-Unixsuperio

Description: National Semiconductor NS87560UBD Super IO controller used in HP [BCJ]x000 workstations.
Platform: | Size: 6144 | Author: zarongza | Hits:

[Linux-Unixsmsc_fdc37m81x

Description: Interface for smsc fdc48m81x Super IO chip.
Platform: | Size: 2048 | Author: jibengfong | Hits:

[Linux-Unixsmsc_fdc37m81x

Description: Interface for smsc fdc48m81x Super IO chip.
Platform: | Size: 2048 | Author: ltmxlui | Hits:

[Other433MHz

Description: 用IO模拟PT2262 2272编解码(433M超再生接收头)。-IO simulated PT2262 2272 codec (433 m super-regenerative receiver head).
Platform: | Size: 2048 | Author: lin | Hits:

CodeBus www.codebus.net