eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::HTMLParser':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: HTMLParser (in HTML)


Inheritance:

   Object
   |
   +--HTML::HTMLParser

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-HTML-Utilities
Version:
rev: 1.84 date: 2019/08/08 12:12:30
user: matilk
file: HTML__HTMLParser.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree
Author:
Claus Gittinger

Description:


Instances of this class are used to read HTML documents
and build a tree of HTML::Element objects.

Notice:
    this is a newer and better version of the (old) parser found in libhtml.
    Due to the space limitations at the time, the old parser was written,
    the old parser used a much simpler html model (simple linked list),
    which is harder to process later.
    Please (try to) use this one for new projects.
    
IMPORTANT: 
    textScannedSoFar is in the characterEncoding of the input data. 
    Conversion takes place when a textBlock is finished!


Related information:

    Element

Class protocol:

accessing
o  preserveScripts
If true, when parsing <script> everything is read up-to the </script> without
expanding ampersand escapes and ignoring other markup.
If false, <script> is parsed without special handling. This is the better
choice if the script inludes '</script>' somewhere, which causes the other
approach to fail.

o  preserveScripts: aBoolean
If true, when parsing <script> everything is read up-to the </script> without
expanding ampersand escapes and ignoring other markup.
If false, <script> is parsed without special handling. This is the better
choice if the script inludes '</script>' somewhere, which causes the other
approach to fail.

initialization
o  ampersandEscapes
backward compatibility only

** This is an obsolete interface - do not use it (it may vanish in future versions) **

o  elementTypes
ElementTypes := nil.
HTMLParser initializeElementTypes

o  initialize
self initializeElementTypes. -- now done lazily in #elementTypes

usage example(s):

     AmpersandEscapes := nil.
     HTMLParser initialize

     MathAmpersandEscapes := nil.
     HTMLParser initialize

o  initializeElementTypes
ElementTypes := nil.
HTMLParser initializeElementTypes

o  mathAmpersandEscapes
backward compatibility only

** This is an obsolete interface - do not use it (it may vanish in future versions) **

parsing
o  parseText: aStringOrStream
parse aStringOrStream; answer the parsed document

usage example(s):

     self parseText:'hello world - this is easy'  
     self parseText:'hello < world > - this is easy'  
     self parseText:'hello world this is easy'  
     self parseText:'hello
world

this is easy' self parseText:'hello

  • world
  • foo

this is easy' self parseText:'

this is easy' self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:('../../doc/online/english/TOP.html' asFilename readStream) self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:'Bönnigheim - Startseite Bönnigheim' characterEncoding:#utf8 self parseText:('/Volumes/tmp/ebayParseError.html' asFilename contentsOfEntireFile asString) characterEncoding:#utf8

o  parseText: aStringOrStream characterEncoding: anEncodingString
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document

usage example(s):

     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8

o  parseText: aStringOrStream characterEncoding: anEncodingString unescapeAttributeValues: unescapeAttributeValuesBoolean
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document

usage example(s):

     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8


Instance protocol:

accessing
o  canonicalTags: aBoolean
if true (the default), parsed tags are all converted to lowercase;
if false, they are kept as found in the HTML source
(only set to false for very special aplications)

o  characterEncoding: aString
set the character set / ecoding for the following text

o  docType

o  validate: aBoolean
turn off validation by passing false

o  validating

error reporting
o  infoMessage: msg
emits a warning about some strange scanner/parser error (but non-fatal).
Bad naming; should be called warningMessage, probably

private
o  addElement: anElement

o  addProcessingInstruction: aProcessingInstruction

o  addText: aString
self error:'Text after end of html ignored' mayProceed:true.

o  classForType: aTypeSymbol
internal interface - return a markup element's class, given a typeSymbol
(such as #b, #pre or #'/pre')

o  elementFor: aString
given a marks string (such as 'b', 'pre' or '/pre'),
return a new markup instance

o  elementForTag: tagString attributeNamesAndValues: attrNamesAndValues

o  endElement: markupTextInclSlash
^ self endElement_old:markupTextInclSlash

o  endElementTag: tag
self assert:(currentElement mustBeClosed not).

o  endElement_new: markupTextInclSlash

o  endElement_old: markupText
self assert:(currentElement mustBeClosed not).

o  finishTextBlock
finish a scanned textBlock; add it to the markup list

o  inPre
return true, if currently in a pre element.
(Do not strip separators of a text block if inside a pre)

o  parseMarkup
'<' has been detected; parse and return a markup element

usage example(s):

^ self parseMarkup_old

o  parseMarkup_new
'<' has been detected; parse and return a markup element

o  parseMarkup_old
'<' has been detected; parse and return a markup element

o  startNewTextBlock

public-scanning
o  parseText: aStringOrStream
parse some string, return a tree of markups

usage example(s):

     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename readStream) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

o  parseText: aStringOrStream characterEncoding: anEncodingString

o  parseText: aStringOrStream characterEncoding: anEncodingString unescapeAttributeValues: unescapeAttributeValuesBoolean

o  parseText: aStringOrStream withBindings: metaBindings
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.

usage example(s):

     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

o  parseText: aStringOrStream withBindings: metaBindings for: aDestination
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.

usage example(s):

     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

scanning
o  ampersandEscape
parse an ampersand escape; the '&' has already been read.

o  ampersandEscape: aString
return a new string, containing the ampersand escape character.
Expects aString to NOT contain the initial ampersand.

usage example(s):

     (HTMLParser new) ampersandEscape:'lt'
     (HTMLParser new) ampersandEscape:'ouml'
     (HTMLParser new) ampersandEscape:'#32'
     (HTMLParser new) ampersandEscape:'#x32'
     (HTMLParser new) ampersandEscape:'#X32'
     (HTMLParser new) ampersandEscape:'apos'

     (HTMLParser new) parseText:'hello α β γ normal'
     (HTMLParser new) parseText:'helloworld

this is easy'

o  ampersandEscapeString
parse an ampersand escape;
the '&' has already been read.
Return the escape string.

o  attributeName
https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#attribute-name
attribute names must consist of one or more characters
other than the space characters, U+0000 NULL, dbl-quote, single-quote, '>', '/', '=',
the control characters, and any characters that are not defined by Unicode.

o  attributeValueDoubleQuoted
skip over initial d-quote

o  attributeValueSingleQuoted
skip over initial s-quote

o  attributeValueUnquoted
see html5 spec https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#syntax-attr-unquoted

o  collectParametersFrom: parameterTextArg
FIXME: code duplication with HTMLMarkup.
Old code, no longer used (see parseMarkup_new)

o  extractMetaInformationFrom: metaElement
<mime-type> ; charset=

o  skipSeparators

scripts

o  parseJavaScriptFrom: scriptStream
HTML

o  parseSmalltalkScriptFrom: scriptStream

o  script: element
a <script> TAG was encountered.
check for the language (which defaults to javaScript) and dispatch
to a script language handler.

o  script_javascript: element
a <script language=javaScript> TAG was encountered.
parse the script, and construct the scriptObject

o  script_smalltalkscript: element
a <script language=smalltalkScript> TAG was encountered.
parse the script, and construct the scriptObject (which has the methods in
its anonymous class)


Examples:


ElementTypes := nil. HTMLParser initializeElementTypes
  |p in document|

  p := HTML::HTMLParser new.
  in := '<head>
<? bla bla bla ?>
<!-- bla bla bla -->
<!-- 
bla bla bla -->
<!-- 
bla bla bla 
-->
</head>
' readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new.
  in := '../../doc/online/english/TOP.html' asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_bestellung'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkImages'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkLinks'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new.
  in := '
<?xml version=''1.0'' encoding=''UTF-8''?>
<!DOCTYPE html PUBLIC ''-//W3C//DTD XHTML 1.0 Strict//EN'' ''http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd''>
<html xmlns=''http://www.w3.org/1999/xhtml'' xml:lang=''en'' lang=''en''>
<head profile=''http://selenium-ide.openqa.org/profiles/test-case''>
<meta http-equiv=''Content-Type'' content=''text/html; charset=UTF-8'' />
<link rel=''selenium.base'' href='''' />
<title>New Test</title>
</head>
<body>
<table cellpadding=''1'' cellspacing=''1'' border=''1''>
<thead>
<tr><td rowspan=''1'' colspan=''3''>New Test</td></tr>
</thead><tbody>

</tbody></table>
</body>
</html>
' readStream.
  document := p parseText:in.
  in close.
  document inspect


ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Wed, 30 Nov 2022 18:21:57 GMT