eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::TextExtractor':

Home

everywhere
www.exept.de
for:
[back]

Class: TextExtractor (in HTML)


Inheritance:

   Object
   |
   +--HTML::Visitor
      |
      +--HTML::TextExtractor

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-Utilities
Version:
rev: 1.2 date: 2010/04/26 11:28:47
user: sr
file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree
Author:
Claus Gittinger

Description:


a tool to extract the raw text of some html 
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or for
conversion to raw-ascii, for example.

CAVEAT:
    I am not sure if this implementation is generic enough for
    other uses in its current state 
    (maybe we have to look for specialities like PRE.../PRE or     
    text within form-elements to make this really correct)


Class protocol:

extraction
o  extractTextFromDocument: domTree

o  extractTextFromHtmlString: htmlString


Instance protocol:

accessing
o  text

ignored visiting
o  visitElement: anElement
Default method for all html elements.

o  visitString: aString


Examples:



     |b document x|

     b := HTML::TreeBuilder new beginWith:(document := Document new).
     b 
        head;
        headEnd;
        body;
          table;
            tr;
              td; text:'aaa'; tdEnd;
              td; text:'bbb'; tdEnd;
            trEnd;
          tableEnd;
        bodyEnd.

     document htmlString inspect.
     x := HTML::TextExtractor new.
     x visit:document.
     x text inspect.


     |document|

     document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
     (HTML::TextExtractor extractTextFromDocument:document) inspect.


     (HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.


ST/X 6.1.1; WebServer 1.620 at exept:8081; Wed, 23 May 2012 09:39:09 GMT