Welcome![Sign In][Sign Up]
Location:
Search - StreamTokenizer

Search list

[Internet-Network用Java编写HTML文件分析程序

Description:

Java编写HTML文件分析程序

 一、概述

    

    Web服务器的核心是对Html文件中的各标记(Tag)作出正确的分析,一种编程语言的解释程序也是对源文件中的保留字进行分析再做解释的。实际应用中,我们也经常会碰到需要对某一特定类型文件进行要害字分析的情况,比如,需要将某个HTML文件下载并同时下载与之相关的.gif.class等文件,此时就要求对HTML文件中的标记进行分离,找出所需的文件名及目录。在Java出现以前,类似工作需要对文件中的每个字符进行分析,从中找出所需部分,不仅编程量大,且易出错。笔者在近期的项目中利用Java的输入流类StreamTokenizer进行HTML文件的分析,效果较好。在此,我们要实现从已知的Web页面下载HTML文件,对其进行分析后,下载该页面中包含的HTML文件(假如在Frame中)、图像文件和ClassJava Applet)文件。

    

    二、StreamTokenizer

    

    StreamTokenizer即令牌化输入流的作用是将一个输入流中变成令牌流。令牌流中的令牌实体有三类:单词(即多字符令牌)、单字符令牌和空白(包括JavaC/C++中的说明语句)。

    

    StreamTokenizer类的构造器为: StreamTokenizer(InputStream in)

    

    该类有一些公有实例变量:ttypesvalnval ,分别表示令牌类型、当前字符串值和当前数字值。当我们需要取得令牌(即HTML中的标记)之间的字符时,应访问变量sval。而读向下一个令牌的方法是调用nextToken()。方法nextToken()的返回值是int型,共有四种可能的返回:

    

    StreamTokenizer.TT_NUMBER: 表示读到的令牌是数字,数字的值是double型,可以从实例变量nval中读取。

    

    StreamTokenizer.TT_Word: 表示读到的令牌是非数字的单词(其他字符也在其中),单词可以从实例变量sval中读取。

    

    StreamTokenizer.TT_EOL: 表示读到的令牌是行结束符。

    

    假如已读到流的尽头,则nextToken()返回TT_EOF

    

    开始调用nextToken()之前,要设置输入流的语法表,以便使分析器辨识不同的字符。WhitespaceChars(int low, int hi)方法定义没有意义的字符的范围。WordChars(int low, int hi)方法定义构造单词的字符范围。

    

    三、程序实现

    

    1HtmlTokenizer类的实现

    

    对某个令牌流进行分析之前,首先应对该令牌流的语法表进行设置,在本例中,即是让程序分出哪个单词是HTML的标记。下面给出针对我们需要的HTML标记的令牌流类定义,它是StreamTokenizer的子类:

    

    

    import java.io.*;

    import java.lang.String;

    class HtmlTokenizer extends

    StreamTokenizer {

    //定义各标记,这里的标记仅是本例中必须的,

    可根据需要自行扩充

     static int HTML_TEXT=-1;

     static int HTML_UNKNOWN=-2;

     static int HTML_EOF=-3;

     static int HTML_IMAGE=-4;

     static int HTML_FRAME=-5;

     static int HTML_BACKGROUND=-6;

     static int HTML_APPLET=-7;

    

    boolean outsideTag=true; //判定是否在标记之中

    

     //构造器,定义该令牌流的语法表。

     public HtmlTokenizer(BufferedReader r) {

    super(r);

    this.resetSyntax(); //重置语法表

    this.wordChars(0,255); //令牌范围为全部字符

    this.ordinaryChar('< '); //HTML标记两边的分割符

    this.ordinaryChar('>');

     } //end of constrUCtor

    

     public int nextHtml(){

    int token; //令牌

    try{

    switch(token=this.nextToken()){

    case StreamTokenizer.TT_EOF:

    //假如已读到流的尽头,则返回TT_EOF

    return HTML_EOF;

    case '< ': //进入标记字段

    outsideTag=false;

    return nextHtml();

    case '>': //出标记字段

    outsideTag=true;

    return nextHtml();

    case StreamTokenizer.TT_WORD:

    //若当前令牌为单词,判定是哪个标记

    if (allWhite(sval))

     return nextHtml(); //过滤其中空格

    else if(sval.toUpperCase().indexOf("FRAME")

    !=-1 && !outsideTag) //标记FRAME

     return HTML_FRAME;

    else if(sval.toUpperCase().indexOf("IMG")

    !=-1 && !outsideTag) //标记IMG

     return HTML_IMAGE;

    else if(sval.toUpperCase().indexOf("BACKGROUND")

    !=-1 && !outsideTag) //标记BACKGROUND

     return HTML_BACKGROUND;

    else if(sval.toUpperCase().indexOf("APPLET")

    !=-1 && !outsideTag) //标记APPLET

     return HTML_APPLET;

    default:

    System.out.println ("Unknown tag: "+token);

    return HTML_UNKNOWN;

     } //end of case

    }catch(IOException e){

    System.out.println("Error:"+e.getMessage());}

    return HTML_UNKNOWN;

     } //end of nextHtml

    

    protected boolean allWhite(String s){//过滤所有空格

    //实现略

     }// end of allWhite

    

    } //end of class

    

    以上方法在近期项目中测试通过,操作系统为Windows NT4,编程工具使用Inprise Jbuilder3


Platform: | Size: 1066 | Author: tiberxu | Hits:

[JSP/Javaanaly3

Description: 编写一个文本文件分析程序,读入一个英文文本文件,统计其中单词、数字、标点符号等元素出现的次数,并记录单词总数。 提示:对文件的分析可以使用StreamTokenizer类-prepared a text file analysis program, read an English text files, statistics words, numbers, punctuation and other elements of the number and the total number of recorded words. Tip : the analysis of the document can be used StreamTokenizer category
Platform: | Size: 1942 | Author: 吴泽伟 | Hits:

[JSP/JavaStreamTokenizer

Description: 本程序为StreamTokenizer类的示例,对输入的文件test.txt进行令牌化,统计其中的单数和数字数以及符号数-StreamTokenizer procedures for the kind of example, the importation of the paper test.txt for token, statistics on the number of single figures and the number of symbols
Platform: | Size: 2171 | Author: 余标 | Hits:

[JSP/Javafio

Description: 1.文件和目录管理,使用java.io.File类编程,完成如下功能: (1)输入文件名称 (2)判断该文件名称是否存在 (3)若文件存在,判断是文件还是目录 (4)若是文件则输出文件的各种属性 (5)若是目录则输出其中包含的所有文件的名称 2.二进制文件的读写操作,用InputStream和OutputStream及其子类,设计并实现一个可以完成文件复制操作的程序。 3.文本文件的读写操作,编写一个文本文件分析程序,读入一个英文文本文件,统计其中单词、数字、标点符号等元素出现的次数,并记录单词总数。提示:对文件的分析可以使用StreamTokenizer类。 4.异常处理程序的编写,在上述三个程序中应用异常处理方法增加程序的错误处理能力。-1. File and directory management, the use of java.io.File class programming to complete the following functions : (a) input file name (2) to determine that the names of the existence of documents (3) If the file exists, or judgment is that the paper catalog (4) If the paper is output files the various attributes (5) If the directories containing the output of all documents name two. Binary file reading or writing, using InputStream and OutputStream and its sub-class design and realization of a complete copy of the operating procedures. 3. The text files and write, to prepare a text file analysis program, read an English text files, statistics words, numbers, punctuation and other elements of the number and the total number of recorded words. Tip : the analysis of the document can be used S
Platform: | Size: 19800 | Author: 罗春威 | Hits:

[ELanguagetokenizer

Description: 当你写词法分析器时,Sun s Java中的 StreamTokenizer 类是很有用的。所以我生成了类似的一个类CStringTokenizer ,它的使用方法类似于Java的StreamTokenizer ,也提供了一些额外的功能,函数名也稍微有些不同-law analyzer, Sun's Java s StreamTokenizer category is very useful. I generated a similar type CStringTokenizer, its use methods similar to the Java StreamTokenizer also provide some additional features, functions were also slightly different
Platform: | Size: 47676 | Author: 侯为 | Hits:

[JSP/Javaanaly3

Description: 编写一个文本文件分析程序,读入一个英文文本文件,统计其中单词、数字、标点符号等元素出现的次数,并记录单词总数。 提示:对文件的分析可以使用StreamTokenizer类-prepared a text file analysis program, read an English text files, statistics words, numbers, punctuation and other elements of the number and the total number of recorded words. Tip : the analysis of the document can be used StreamTokenizer category
Platform: | Size: 2048 | Author: | Hits:

[JSP/JavaStreamTokenizer

Description: 本程序为StreamTokenizer类的示例,对输入的文件test.txt进行令牌化,统计其中的单数和数字数以及符号数-StreamTokenizer procedures for the kind of example, the importation of the paper test.txt for token, statistics on the number of single figures and the number of symbols
Platform: | Size: 2048 | Author: 余标 | Hits:

[JSP/Javafio

Description: 1.文件和目录管理,使用java.io.File类编程,完成如下功能: (1)输入文件名称 (2)判断该文件名称是否存在 (3)若文件存在,判断是文件还是目录 (4)若是文件则输出文件的各种属性 (5)若是目录则输出其中包含的所有文件的名称 2.二进制文件的读写操作,用InputStream和OutputStream及其子类,设计并实现一个可以完成文件复制操作的程序。 3.文本文件的读写操作,编写一个文本文件分析程序,读入一个英文文本文件,统计其中单词、数字、标点符号等元素出现的次数,并记录单词总数。提示:对文件的分析可以使用StreamTokenizer类。 4.异常处理程序的编写,在上述三个程序中应用异常处理方法增加程序的错误处理能力。-1. File and directory management, the use of java.io.File class programming to complete the following functions : (a) input file name (2) to determine that the names of the existence of documents (3) If the file exists, or judgment is that the paper catalog (4) If the paper is output files the various attributes (5) If the directories containing the output of all documents name two. Binary file reading or writing, using InputStream and OutputStream and its sub-class design and realization of a complete copy of the operating procedures. 3. The text files and write, to prepare a text file analysis program, read an English text files, statistics words, numbers, punctuation and other elements of the number and the total number of recorded words. Tip : the analysis of the document can be used S
Platform: | Size: 19456 | Author: 罗春威 | Hits:

[ELanguagetokenizer

Description: 当你写词法分析器时,Sun s Java中的 StreamTokenizer 类是很有用的。所以我生成了类似的一个类CStringTokenizer ,它的使用方法类似于Java的StreamTokenizer ,也提供了一些额外的功能,函数名也稍微有些不同-law analyzer, Sun's Java s StreamTokenizer category is very useful. I generated a similar type CStringTokenizer, its use methods similar to the Java StreamTokenizer also provide some additional features, functions were also slightly different
Platform: | Size: 47104 | Author: 侯为 | Hits:

[Mathimatics-Numerical algorithmsarithmetic_parser

Description: Arithmetic expression parser Description: The code includes a portable tokenizer like the StreamTokenizer in Java. It parses and interprets an arithmetic expression expressed in a flexible C-like syntax.-Arithmetic expression parser Description: The code includes a portable tokenizer like the StreamTokenizer in Java. It parses and interprets an arithmetic expression expressed in a flexible C-like syntax.
Platform: | Size: 4096 | Author: SC | Hits:

[JSP/JavaFormat_input

Description: 使用StreamTokenizer类对象读取标准输入的类,可以判断用户输入的值是否是整型、双精度型和字符串型。-StreamTokenizer class object using a class to read standard input, the user can determine whether the value entered is an integer, double and string.
Platform: | Size: 1024 | Author: Shirley | Hits:

[JSP/JavaCalculette

Description: 这是我在大学波尔多1类PROJET。我professieur Mr.Baudon创世这些代码给Java结构hashtree的例子。这是评估一个数值表达式,实现了类计算器。-This is to achieve a class calculator evaluating a numeric expression. Prefixed expressions are used to facilitate analysis: no priority no associativity the battery is required to calculate the recursive calls For example,+/4 2 1 the result is 3 (= 4/2+ 1) On the other hand, the character - being used as a sign by StreamTokenizer, the corresponding binary operator will be represented by the word "less".
Platform: | Size: 6144 | Author: jin | Hits:

[Linux-UnixStreamTokenizer

Description: Stream Tokenizer for Linux.
Platform: | Size: 4096 | Author: xengjinghon | Hits:

[Linux-UnixANTLRTreePatternLexer

Description: Set when token type is ID or ARG (name mimics Java s StreamTokenizer).
Platform: | Size: 1024 | Author: wingailx | Hits:

CodeBus www.codebus.net