|
Class: HTMLParser (in HTML)
Object
|
+--HTML::HTMLParser
- Package:
- stx:goodies/webServer/htmlTree
- Category:
- Net-Documents-HTML-Utilities
- Version:
- rev:
1.84
date: 2019/08/08 12:12:30
- user: matilk
- file: HTML__HTMLParser.st directory: goodies/webServer/htmlTree
- module: stx stc-classLibrary: htmlTree
- Author:
- Claus Gittinger
Instances of this class are used to read HTML documents
and build a tree of HTML::Element objects.
Notice:
this is a newer and better version of the (old) parser found in libhtml.
Due to the space limitations at the time, the old parser was written,
the old parser used a much simpler html model (simple linked list),
which is harder to process later.
Please (try to) use this one for new projects.
IMPORTANT:
textScannedSoFar is in the characterEncoding of the input data.
Conversion takes place when a textBlock is finished!
Element
accessing
-
preserveScripts
-
If true, when parsing <script> everything is read up-to the </script> without
expanding ampersand escapes and ignoring other markup.
If false, <script> is parsed without special handling. This is the better
choice if the script inludes '</script>' somewhere, which causes the other
approach to fail.
-
preserveScripts: aBoolean
-
If true, when parsing <script> everything is read up-to the </script> without
expanding ampersand escapes and ignoring other markup.
If false, <script> is parsed without special handling. This is the better
choice if the script inludes '</script>' somewhere, which causes the other
approach to fail.
initialization
-
ampersandEscapes
-
backward compatibility only
** This is an obsolete interface - do not use it (it may vanish in future versions) **
-
elementTypes
-
ElementTypes := nil.
HTMLParser initializeElementTypes
-
initialize
-
self initializeElementTypes. -- now done lazily in #elementTypes
usage example(s):
AmpersandEscapes := nil.
HTMLParser initialize
MathAmpersandEscapes := nil.
HTMLParser initialize
|
-
initializeElementTypes
-
ElementTypes := nil.
HTMLParser initializeElementTypes
-
mathAmpersandEscapes
-
backward compatibility only
** This is an obsolete interface - do not use it (it may vanish in future versions) **
parsing
-
parseText: aStringOrStream
-
parse aStringOrStream; answer the parsed document
usage example(s):
self parseText:'hello world - this is easy'
self parseText:'hello < world > - this is easy'
self parseText:'hello world this is easy'
self parseText:'hello world this is easy'
self parseText:'hello this is easy'
self parseText:' this is easy'
self
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
self
parseText:('../../doc/online/english/TOP.html'
asFilename readStream)
self
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
self parseText:'Bönnigheim - Startseite Bönnigheim' characterEncoding:#utf8
self
parseText:('/Volumes/tmp/ebayParseError.html' asFilename contentsOfEntireFile asString)
characterEncoding:#utf8
|
-
parseText: aStringOrStream characterEncoding: anEncodingString
-
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').
Answer the parsed document
usage example(s):
self
parseText:('/tmp/DER-Tour-01.html'
asFilename contentsOfEntireFile asString) characterEncoding:#utf8
|
-
parseText: aStringOrStream characterEncoding: anEncodingString unescapeAttributeValues: unescapeAttributeValuesBoolean
-
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').
Answer the parsed document
usage example(s):
self
parseText:('/tmp/DER-Tour-01.html'
asFilename contentsOfEntireFile asString) characterEncoding:#utf8
|
accessing
-
canonicalTags: aBoolean
-
if true (the default), parsed tags are all converted to lowercase;
if false, they are kept as found in the HTML source
(only set to false for very special aplications)
-
characterEncoding: aString
-
set the character set / ecoding for the following text
-
docType
-
-
validate: aBoolean
-
turn off validation by passing false
-
validating
-
error reporting
-
infoMessage: msg
-
emits a warning about some strange scanner/parser error (but non-fatal).
Bad naming; should be called warningMessage, probably
private
-
addElement: anElement
-
-
addProcessingInstruction: aProcessingInstruction
-
-
addText: aString
-
self error:'Text after end of html ignored' mayProceed:true.
-
classForType: aTypeSymbol
-
internal interface - return a markup element's class, given a typeSymbol
(such as #b, #pre or #'/pre')
-
elementFor: aString
-
given a marks string (such as 'b', 'pre' or '/pre'),
return a new markup instance
-
elementForTag: tagString attributeNamesAndValues: attrNamesAndValues
-
-
endElement: markupTextInclSlash
-
^ self endElement_old:markupTextInclSlash
-
endElementTag: tag
-
self assert:(currentElement mustBeClosed not).
-
endElement_new: markupTextInclSlash
-
-
endElement_old: markupText
-
self assert:(currentElement mustBeClosed not).
-
finishTextBlock
-
finish a scanned textBlock; add it to the markup list
-
inPre
-
return true, if currently in a pre element.
(Do not strip separators of a text block if inside a pre)
-
parseMarkup
-
'<' has been detected; parse and return a markup element
usage example(s):
-
parseMarkup_new
-
'<' has been detected; parse and return a markup element
-
parseMarkup_old
-
'<' has been detected; parse and return a markup element
-
startNewTextBlock
-
public-scanning
-
parseText: aStringOrStream
-
parse some string, return a tree of markups
usage example(s):
(HTMLParser new) parseText:'hello world - this is easy'
(HTMLParser new) parseText:'hello < world > - this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello this is easy'
(HTMLParser new) parseText:' this is easy'
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename readStream)
(HTMLParser new)
parseText:('../../doc/online/english/programming/viewintro.html'
asFilename contentsOfEntireFile asString)
|
-
parseText: aStringOrStream characterEncoding: anEncodingString
-
-
parseText: aStringOrStream characterEncoding: anEncodingString unescapeAttributeValues: unescapeAttributeValuesBoolean
-
-
parseText: aStringOrStream withBindings: metaBindings
-
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.
usage example(s):
(HTMLParser new) parseText:'hello world - this is easy'
(HTMLParser new) parseText:'hello < world > - this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello this is easy'
(HTMLParser new) parseText:' this is easy'
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
(HTMLParser new)
parseText:('../../doc/online/english/programming/viewintro.html'
asFilename contentsOfEntireFile asString)
|
-
parseText: aStringOrStream withBindings: metaBindings for: aDestination
-
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.
usage example(s):
(HTMLParser new) parseText:'hello world - this is easy'
(HTMLParser new) parseText:'hello < world > - this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello this is easy'
(HTMLParser new) parseText:' this is easy'
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
(HTMLParser new)
parseText:('../../doc/online/english/programming/viewintro.html'
asFilename contentsOfEntireFile asString)
|
scanning
-
ampersandEscape
-
parse an ampersand escape; the '&' has already been read.
-
ampersandEscape: aString
-
return a new string, containing the ampersand escape character.
Expects aString to NOT contain the initial ampersand.
usage example(s):
(HTMLParser new) ampersandEscape:'lt'
(HTMLParser new) ampersandEscape:'ouml'
(HTMLParser new) ampersandEscape:'#32'
(HTMLParser new) ampersandEscape:'#x32'
(HTMLParser new) ampersandEscape:'#X32'
(HTMLParser new) ampersandEscape:'apos'
(HTMLParser new) parseText:'hello α β γ normal'
(HTMLParser new) parseText:'hello
-
ampersandEscapeString
-
parse an ampersand escape;
the '&' has already been read.
Return the escape string.
-
attributeName
-
https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#attribute-name
attribute names must consist of one or more characters
other than the space characters, U+0000 NULL, dbl-quote, single-quote, '>', '/', '=',
the control characters, and any characters that are not defined by Unicode.
-
attributeValueDoubleQuoted
-
skip over initial d-quote
-
attributeValueSingleQuoted
-
skip over initial s-quote
-
attributeValueUnquoted
-
see html5 spec https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#syntax-attr-unquoted
-
collectParametersFrom: parameterTextArg
-
FIXME: code duplication with HTMLMarkup.
Old code, no longer used (see parseMarkup_new)
-
extractMetaInformationFrom: metaElement
-
<mime-type> ; charset=
-
skipSeparators
-
scripts
-
parseJavaScriptFrom: scriptStream
-
HTML
-
parseSmalltalkScriptFrom: scriptStream
-
-
script: element
-
a <script> TAG was encountered.
check for the language (which defaults to javaScript) and dispatch
to a script language handler.
-
script_javascript: element
-
a <script language=javaScript> TAG was encountered.
parse the script, and construct the scriptObject
-
script_smalltalkscript: element
-
a <script language=smalltalkScript> TAG was encountered.
parse the script, and construct the scriptObject (which has the methods in
its anonymous class)
ElementTypes := nil.
HTMLParser initializeElementTypes
|p in document|
p := HTML::HTMLParser new.
in := '<head>
<? bla bla bla ?>
<!-- bla bla bla -->
<!--
bla bla bla -->
<!--
bla bla bla
-->
</head>
' readStream.
document := p parseText:in.
in close.
document inspect
|
|p in document|
p := HTML::HTMLParser new.
in := '../../doc/online/english/TOP.html' asFilename readStream.
document := p parseText:in.
in close.
document inspect
|
|p in document|
p := HTML::HTMLParser new.
in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_bestellung'
asFilename readStream.
document := p parseText:in.
in close.
document inspect.
|
|p in document|
p := HTML::HTMLParser new.
in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkImages'
asFilename readStream.
document := p parseText:in.
in close.
document inspect.
|
|p in document|
p := HTML::HTMLParser new.
in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkLinks'
asFilename readStream.
document := p parseText:in.
in close.
document inspect.
|
|p in document|
p := HTML::HTMLParser new.
in := '
<?xml version=''1.0'' encoding=''UTF-8''?>
<!DOCTYPE html PUBLIC ''-//W3C//DTD XHTML 1.0 Strict//EN'' ''http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd''>
<html xmlns=''http://www.w3.org/1999/xhtml'' xml:lang=''en'' lang=''en''>
<head profile=''http://selenium-ide.openqa.org/profiles/test-case''>
<meta http-equiv=''Content-Type'' content=''text/html; charset=UTF-8'' />
<link rel=''selenium.base'' href='''' />
<title>New Test</title>
</head>
<body>
<table cellpadding=''1'' cellspacing=''1'' border=''1''>
<thead>
<tr><td rowspan=''1'' colspan=''3''>New Test</td></tr>
</thead><tbody>
</tbody></table>
</body>
</html>
' readStream.
document := p parseText:in.
in close.
document inspect
|
|
|
ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Tue, 19 Mar 2024 09:26:11 GMT
|
|