eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::HTMLParser':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: HTMLParser (in HTML)


Inheritance:

   Object
   |
   +--HTML::HTMLParser

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-HTML-Utilities
Version:
rev: 1.126 date: 2024/04/22 17:41:10
user: stefan
file: HTML__HTMLParser.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree

Description:


Instances of this class are used to read HTML documents
and build a tree of HTML::Element objects.

Notice:
    this is a newer and better version of the (old) parser found in libhtml.
    Due to the space limitations at the time, the old parser was written,
    the old parser used a much simpler html model (simple linked list),
    which is harder to process later.
    Please (try to) use this one for new projects.
    
IMPORTANT: 
    textScannedSoFar is in the characterEncoding of the input data. 
    Conversion takes place when a textBlock is finished!

copyright

COPYRIGHT (c) 1996 by Claus Gittinger All Rights Reserved This software is furnished under a license and may be used only in accordance with the terms of that license and with the inclusion of the above copyright notice. This software may not be provided or otherwise made available to, or used by, any other person. No title to or ownership of the software is hereby transferred.

Class protocol:

initialization
o  ampersandEscapes
backward compatibility only

** This is an obsolete interface - do not use it (it may vanish in future versions) **

o  elementTypes
ElementTypes := nil.
HTMLParser initializeElementTypes

o  initializeElementTypes
ElementTypes := nil.
HTMLParser initializeElementTypes

o  mathAmpersandEscapes
backward compatibility only

** This is an obsolete interface - do not use it (it may vanish in future versions) **

instance creation
o  new
(comment from inherited method)
return an instance of myself without indexed variables

parsing
o  parse: aStringOrStream
parse aStringOrStream; answer the parsed document.
For API compatibility with XMLParser

Usage example(s):

     self parse:'hello world - this is easy'  
     self parse:'hello < world > - this is easy'  
     self parse:'hello world this is easy'  
     self parse:'hello
world

this is easy' self parse:'hello

  • world
  • foo

this is easy' self parse:'

this is easy'

o  parseFile: aFilename
self parseFile:'../../doc/online/english/top.html'

o  parseText: aStringOrStream
parse aStringOrStream; answer the parsed document

Usage example(s):

     self parseText:'hello world - this is easy'  
     self parseText:'hello < world > - this is easy'  
     self parseText:'hello world this is easy'  
     self parseText:'hello
world

this is easy' self parseText:'hello

  • world
  • foo

this is easy' self parseText:'

this is easy' self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:('../../doc/online/english/TOP.html' asFilename readStream) self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:'Bönnigheim - Startseite Bönnigheim' characterEncoding:#utf8 self parseText:('/Volumes/tmp/ebayParseError.html' asFilename contentsOfEntireFile asString) characterEncoding:#utf8

o  parseText: aStringOrStream characterEncoding: anEncodingString
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document

Usage example(s):

     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8

o  parseText: aStringOrStream characterEncoding: anEncodingString unescapeAttributeValues: unescapeAttributeValuesBoolean
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document

Usage example(s):

     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8


Instance protocol:

accessing
o  canonicalTags: aBoolean
if true (the default), parsed tags are all converted to lowercase;
if false, they are kept as found in the HTML source
(only set to false for very special aplications)

o  characterEncoding: aString
set the character set / ecoding for the following text

o  docType

o  validate: aBoolean
turn off validation by passing false (which is the default, btw.)

o  validating

error reporting
o  infoMessage: msg
emits a warning about some strange scanner/parser error (but non-fatal).
Bad naming; should be called warningMessage, probably

initialization
o  initStates
see https://html.spec.whatwg.org/multipage/parsing.html#tokenization 13.2.5.72

o  initialize
unescapeTextContent := true.

private
o  addElement: anElement
currentElement is the currently open element (eg. <table>);
its parent is on the elementStack.
anElement is an incoming new element start (eg. <tr>).

If anElement is allowed as child of the current element
then
add the new element to it.

if anElement (the new one) can have children, push the previous current element
onto the stack, and make the new element the current element.
otherwise,
leave current element as is.

o  addProcessingInstruction: aProcessingInstruction

o  addText: aString
self error:'Text after end of html ignored' mayProceed:true.

o  beTagClose

o  beTagOpen

o  classForType: aTypeSymbol
internal interface - return a markup element's class, given a typeSymbol
(such as #b, #pre or #'/pre')

o  elementFor: aString
given a marks string (such as 'b', 'pre' or '/pre'),
return a new markup instance

o  elementForTag: tagString attributeNamesAndValues: attrNamesAndValues

o  endComment

o  endElement: markupTextInclSlash
^ self endElement_old:markupTextInclSlash

o  endElementTag: tag
wrong nesting; closing tag but not the current element's tag

o  endElement_new: markupTextInclSlash
remove the slash

o  endElement_old: markupText
self assert:(currentElement mustBeClosed not).

o  endTag

o  finishTextBlock
finish a scanned textBlock; add it to the markup list

o  inPre
return true, if currently in a pre element.
(Do not strip separators of a text block if inside a pre)

o  initializeTagBuffer

o  initializeTagBuffer: forTagOpem

o  initializeTemporaryBuffer

o  parseDOCTYPE: in

o  parseMarkup
'<' has been detected; parse and return a markup element

o  parseScript: inStreamArg
parse the contents of a <script> element
according to: https://www.w3.org/TR/html52/semantics-scripting.html#script-content-restrictions.
Read everything up-to the </script>.
Do not expand ampersand escapes; ignore other markup.

Usage example(s):

        self new parseScript:'bla bla bla ' readStream
        self new parseScript:'bla bla bla ' readStream
        self new parseScript:'' readStream
        self new parseScript:'' readStream
     Illegal:
        self new parseScript:'bla bla bla ') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'!-->') printString    
        =
        ((HTMLParser new) parseText_new:'!-->') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a & b') printString    
        =
        ((HTMLParser new) parseText_new:'a & b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a && b') printString    
        =
        ((HTMLParser new) parseText_new:'a && b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a &; b') printString    
        =
        ((HTMLParser new) parseText_new:'a &; b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a &x; b') printString    
        =
        ((HTMLParser new) parseText_new:'a &x; b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'I''m ∉ I tell you') printString    
        =
        ((HTMLParser new) parseText_new:'I''m ∉ I tell you') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'I''m ¬it; I tell you') printString    
        =
        ((HTMLParser new) parseText_new:'I''m ¬it; I tell you') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:c'') printString    
        =
        ((HTMLParser new) parseText_new:c'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:c'
\nI''m ¬it;\n\nI tell you-->') printString    
        =
        ((HTMLParser new) parseText_new:c'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:c'

') printString = ((HTMLParser new) parseText_new:c'

') printString ) old version is buggy: ((HTMLParser new) parseText_old:'') printString ((HTMLParser new) parseText_new:'') printString ((HTMLParser new) parseText_old:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString)) = ((HTMLParser new) parseText_new:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString)) ((HTMLParser new) parseText_new:('../../doc/online/english/TOP.html' asFilename readStream)) printString = ((HTMLParser new) parseText_old:('../../doc/online/english/TOP.html' asFilename readStream)) printString DiffTextView openOn:((HTMLParser new) parseText_new:('../../doc/online/english/TOP.html' asFilename readStream)) printString and:((HTMLParser new) parseText_old:('../../doc/online/english/TOP.html' asFilename readStream)) printString. DiffTextView openOn:((HTMLParser new) parseText_new:('../../doc/online/english/programming/viewintro.html' asFilename readStream)) printString and:((HTMLParser new) parseText_old:('../../doc/online/english/programming/viewintro.html' asFilename readStream)) printString.

o  parseText_old: aStringOrStream
parse some string, return a tree of markups

Usage example(s):

     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText_old:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename readStream) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

scanning
o  ampersandEscape
parse an ampersand escape; the '&' has already been read.

o  ampersandEscape: aString
return a new string, containing the ampersand escape character.
Expects aString to NOT contain the initial ampersand.

Usage example(s):

     (HTMLParser new) ampersandEscape:'lt'
     (HTMLParser new) ampersandEscape:'ouml'
     (HTMLParser new) ampersandEscape:'#32'
     (HTMLParser new) ampersandEscape:'#x32'
     (HTMLParser new) ampersandEscape:'#X32'
     (HTMLParser new) ampersandEscape:'apos'
     (HTMLParser new) ampersandEscape:'blabla'

     (HTMLParser new) parseText:'hello α β γ normal'
     (HTMLParser new) parseText:'helloworld

this is easy'

o  ampersandEscapeString
parse an ampersand escape;
the '&' has already been read.
Return the escaped string.

o  attributeName
https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#attribute-name
attribute names must consist of one or more characters
other than the space characters, U+0000 NULL, dbl-quote, single-quote, '>', '/', '=',
the control characters, and any characters that are not defined by Unicode.

o  attributeValueDoubleQuoted
skip over initial d-quote

o  attributeValueSingleQuoted
skip over initial s-quote

o  attributeValueUnquoted
see html5 spec https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#syntax-attr-unquoted

o  collectParametersFrom: parameterTextArg
FIXME: code duplication with HTMLMarkup.
Old code, no longer used (see parseMarkup_new)

o  extractMetaInformationFrom: metaElement
<mime-type> ; charset=

o  skipSeparators

scripts

o  parseJavaScriptFrom: scriptStream
HTML

o  parseSmalltalkScriptFrom: scriptStream

o  script: element
a <script> TAG was encountered.
check for the language (which defaults to javaScript) and dispatch
to a script language handler.

o  script_javascript: element
a <script language=javaScript> TAG was encountered.
parse the script, and construct the scriptObject

o  script_smalltalkscript: element
a <script language=smalltalkScript> TAG was encountered.
parse the script, and construct the scriptObject (which has the methods in
its anonymous class)


Examples:


ElementTypes := nil. HTMLParser initializeElementTypes
  |p in document|

  p := HTML::HTMLParser new.
  in := '<head>
<? bla bla bla ?>
<!-- bla bla bla -->
<!-- 
bla bla bla -->
<!-- 
bla bla bla 
-->
</head>
' readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new.
  in := '../../doc/online/english/TOP.html' asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_bestellung'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkImages'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkLinks'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new.
  in := '
<?xml version=''1.0'' encoding=''UTF-8''?>
<!DOCTYPE html PUBLIC ''-//W3C//DTD XHTML 1.0 Strict//EN'' ''http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd''>
<html xmlns=''http://www.w3.org/1999/xhtml'' xml:lang=''en'' lang=''en''>
<head profile=''http://selenium-ide.openqa.org/profiles/test-case''>
<meta http-equiv=''Content-Type'' content=''text/html; charset=UTF-8'' />
<link rel=''selenium.base'' href='''' />
<title>New Test</title>
</head>
<body>
<table cellpadding=''1'' cellspacing=''1'' border=''1''>
<thead>
<tr><td rowspan=''1'' colspan=''3''>New Test</td></tr>
</thead><tbody>

</tbody></table>
</body>
</html>
' readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new.
  in := '<html><body>combining: a&#768;rest' readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p document|

  p := HTML::HTMLParser new.
  document := p parseText:'&auml; <script>
  bla &auml; bla
</script> &auml; <div> &auml; </div> <script>
  more bla &auml; bla
</script> '.
  document inspect


ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Wed, 22 Jan 2025 05:52:05 GMT