eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::TextExtractor':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: TextExtractor (in HTML)


Inheritance:

   Object
   |
   +--HTML::Visitor
      |
      +--HTML::TextExtractor
         |
         +--HTML::RichTextExtractor

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-HTML-Utilities
Version:
rev: 1.10 date: 2021/01/20 15:29:17
user: cg
file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree

Description:


a tool to extract the raw text of some html 
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or to
create a wordlist, for example.

CAVEAT:
    This implementation is too simplistic for
    other uses; it does not care for any formatting.
    Take a look at HTMLToTextConverter for a better converter,
    if you want to preserve some of the formatting.

copyright

COPYRIGHT (c) 2007 by eXept Software AG All Rights Reserved This software is furnished under a license and may be used only in accordance with the terms of that license and with the inclusion of the above copyright notice. This software may not be provided or otherwise made available to, or used by, any other person. No title to or ownership of the software is hereby transferred.

Class protocol:

extraction
o  extractTextFromDocument: domTree

o  extractTextFromElement: htmlElement

o  extractTextFromHtmlString: htmlString

instance creation
o  new
return an initialized instance


Instance protocol:

accessing
o  text

initialization
o  initialize
allow for a subclass to have this already initialized

visiting
o  appendString: aString

o  visitElement: anElement
Default method for all html elements.

o  visitHead: anElement
the head is ignored here.

o  visitString: aString
(comment from inherited method)
Default method for all html text pieces.
To be defined in subclasses.


Examples:


     |b document x|

     b := HTML::TreeBuilder new beginWith:(document := Document new).
     b 
        head;
        headEnd;
        body;
          table;
            tr;
              td; text:'aaa'; tdEnd;
              td; text:'bbb'; tdEnd;
            trEnd;
          tableEnd;
        bodyEnd.

     document htmlString inspect.
     x := HTML::TextExtractor new.
     x visit:document.
     x text inspect.
     |document|

     document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
     (HTML::TextExtractor extractTextFromDocument:document) inspect.
     (HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.


ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Thu, 02 Jan 2025 14:54:00 GMT