|
Class: TextExtractor (in HTML)
Object
|
+--HTML::Visitor
|
+--HTML::TextExtractor
|
+--HTML::RichTextExtractor
- Package:
- stx:goodies/webServer/htmlTree
- Category:
- Net-Documents-HTML-Utilities
- Version:
- rev:
1.10
date: 2021/01/20 15:29:17
- user: cg
- file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
- module: stx stc-classLibrary: htmlTree
a tool to extract the raw text of some html
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or to
create a wordlist, for example.
CAVEAT:
This implementation is too simplistic for
other uses; it does not care for any formatting.
Take a look at HTMLToTextConverter for a better converter,
if you want to preserve some of the formatting.
copyrightCOPYRIGHT (c) 2007 by eXept Software AG
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
extraction
-
extractTextFromDocument: domTree
-
-
extractTextFromElement: htmlElement
-
-
extractTextFromHtmlString: htmlString
-
instance creation
-
new
-
return an initialized instance
accessing
-
text
-
initialization
-
initialize
-
allow for a subclass to have this already initialized
visiting
-
appendString: aString
-
-
visitElement: anElement
-
Default method for all html elements.
-
visitHead: anElement
-
the head is ignored here.
-
visitString: aString
-
(comment from inherited method)
Default method for all html text pieces.
To be defined in subclasses.
|b document x|
b := HTML::TreeBuilder new beginWith:(document := Document new).
b
head;
headEnd;
body;
table;
tr;
td; text:'aaa'; tdEnd;
td; text:'bbb'; tdEnd;
trEnd;
tableEnd;
bodyEnd.
document htmlString inspect.
x := HTML::TextExtractor new.
x visit:document.
x text inspect.
|
|document|
document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
(HTML::TextExtractor extractTextFromDocument:document) inspect.
|
(HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.
|
|