Smalltalk/X Webserver

Documentation of class 'HTML::TextExtractor':

Class: TextExtractor (in HTML)

Inheritance
Description
Class protocol
- extraction
- instance creation
Instance protocol
Examples

Inheritance:

   Object
   |
   +--HTML::Visitor
      |
      +--HTML::TextExtractor
         |
         +--HTML::RichTextExtractor

Package:: stx:goodies/webServer/htmlTree

Category:: Net-Documents-HTML-Utilities

Version:: rev: 1.10 date: 2021/01/20 15:29:17; user: cg; file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree; module: stx stc-classLibrary: htmlTree

Description:

a tool to extract the raw text of some html 
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or to
create a wordlist, for example.

CAVEAT:
    This implementation is too simplistic for
    other uses; it does not care for any formatting.
    Take a look at HTMLToTextConverter for a better converter,
    if you want to preserve some of the formatting.

copyrightCOPYRIGHT (c) 2007 by eXept Software AG
             All Rights Reserved

This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice.   This software may not
be provided or otherwise made available to, or used by, any
other person.  No title to or ownership of the software is
hereby transferred.

Class protocol:

extraction

extractTextFromDocument: domTree
extractTextFromElement: htmlElement
extractTextFromHtmlString: htmlString

instance creation

new: return an initialized instance

Instance protocol:

accessing

text

initialization

initialize: allow for a subclass to have this already initialized

visiting

appendString: aString
visitElement: anElement: Default method for all html elements.
visitHead: anElement: the head is ignored here.
visitString: aString: (comment from inherited method)
Default method for all html text pieces.
To be defined in subclasses.

Examples:

     |b document x|

     b := HTML::TreeBuilder new beginWith:(document := Document new).
     b 
        head;
        headEnd;
        body;
          table;
            tr;
              td; text:'aaa'; tdEnd;
              td; text:'bbb'; tdEnd;
            trEnd;
          tableEnd;
        bodyEnd.

     document htmlString inspect.
     x := HTML::TextExtractor new.
     x visit:document.
     x text inspect.

     |document|

     document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
     (HTML::TextExtractor extractTextFromDocument:document) inspect.

     (HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.

ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Mon, 07 Jul 2025 00:59:01 GMT