eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::TextExtractor':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: TextExtractor (in HTML)


Inheritance:

   Object
   |
   +--HTML::Visitor
      |
      +--HTML::TextExtractor
         |
         +--HTML::RichTextExtractor

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-HTML-Utilities
Version:
rev: 1.9 date: 2018/06/27 13:29:02
user: cg
file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree
Author:
Claus Gittinger

Description:


a tool to extract the raw text of some html 
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or to
create a wordlist, for example.

CAVEAT:
    This implementation is too simplistic for
    other uses; it does not care for any formatting.
    Take a look at HTMLToTextConverter for a better converter,
    if you want to preserve some of the formatting.


Related information:

    HTML::RichTextExtractor
    HTML::HTMLToTextConverter

Class protocol:

extraction
o  extractTextFromDocument: domTree

o  extractTextFromElement: htmlElement

o  extractTextFromHtmlString: htmlString

instance creation
o  new
return an initialized instance


Instance protocol:

accessing
o  text

initialization
o  initialize
allow for a subclass to have this already initialized

visiting
o  appendString: aString

o  visitElement: anElement
Default method for all html elements.

o  visitHead: anElement
the head is ignored here.

o  visitString: aString


Examples:


     |b document x|

     b := HTML::TreeBuilder new beginWith:(document := Document new).
     b 
        head;
        headEnd;
        body;
          table;
            tr;
              td; text:'aaa'; tdEnd;
              td; text:'bbb'; tdEnd;
            trEnd;
          tableEnd;
        bodyEnd.

     document htmlString inspect.
     x := HTML::TextExtractor new.
     x visit:document.
     x text inspect.
     |document|

     document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
     (HTML::TextExtractor extractTextFromDocument:document) inspect.
     (HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.


ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Wed, 30 Nov 2022 16:46:24 GMT