|
Class: TextExtractor (in HTML)
Object
|
+--HTML::Visitor
|
+--HTML::TextExtractor
|
+--HTML::RichTextExtractor
- Package:
- stx:goodies/webServer/htmlTree
- Category:
- Net-Documents-HTML-Utilities
- Version:
- rev:
1.9
date: 2018/06/27 13:29:02
- user: cg
- file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
- module: stx stc-classLibrary: htmlTree
- Author:
- Claus Gittinger
a tool to extract the raw text of some html
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or to
create a wordlist, for example.
CAVEAT:
This implementation is too simplistic for
other uses; it does not care for any formatting.
Take a look at HTMLToTextConverter for a better converter,
if you want to preserve some of the formatting.
HTML::RichTextExtractor
HTML::HTMLToTextConverter
extraction
-
extractTextFromDocument: domTree
-
-
extractTextFromElement: htmlElement
-
-
extractTextFromHtmlString: htmlString
-
instance creation
-
new
-
return an initialized instance
accessing
-
text
-
initialization
-
initialize
-
allow for a subclass to have this already initialized
visiting
-
appendString: aString
-
-
visitElement: anElement
-
Default method for all html elements.
-
visitHead: anElement
-
the head is ignored here.
-
visitString: aString
-
|b document x|
b := HTML::TreeBuilder new beginWith:(document := Document new).
b
head;
headEnd;
body;
table;
tr;
td; text:'aaa'; tdEnd;
td; text:'bbb'; tdEnd;
trEnd;
tableEnd;
bodyEnd.
document htmlString inspect.
x := HTML::TextExtractor new.
x visit:document.
x text inspect.
|
|document|
document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
(HTML::TextExtractor extractTextFromDocument:document) inspect.
|
(HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.
|
|