Smalltalk/X Webserver

Documentation of class 'HTML::HTMLParser':

Class: HTMLParser (in HTML)

Inheritance
Description
Class protocol
Instance protocol
Examples

Inheritance:

   Object
   |
   +--HTML::HTMLParser

Package:: stx:goodies/webServer/htmlTree

Category:: Net-Documents-HTML-Utilities

Version:: rev: 1.126 date: 2024/04/22 17:41:10; user: stefan; file: HTML__HTMLParser.st directory: goodies/webServer/htmlTree; module: stx stc-classLibrary: htmlTree

Description:

Instances of this class are used to read HTML documents
and build a tree of HTML::Element objects.

Notice:
    this is a newer and better version of the (old) parser found in libhtml.
    Due to the space limitations at the time, the old parser was written,
    the old parser used a much simpler html model (simple linked list),
    which is harder to process later.
    Please (try to) use this one for new projects.
    
IMPORTANT: 
    textScannedSoFar is in the characterEncoding of the input data. 
    Conversion takes place when a textBlock is finished!

copyrightCOPYRIGHT (c) 1996 by Claus Gittinger
             All Rights Reserved

This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice.   This software may not
be provided or otherwise made available to, or used by, any
other person.  No title to or ownership of the software is
hereby transferred.

Class protocol:

initialization

ampersandEscapes: backward compatibility only

** This is an obsolete interface - do not use it (it may vanish in future versions) **
elementTypes: ElementTypes := nil.
HTMLParser initializeElementTypes
initializeElementTypes: ElementTypes := nil.
HTMLParser initializeElementTypes
mathAmpersandEscapes: backward compatibility only

** This is an obsolete interface - do not use it (it may vanish in future versions) **

instance creation

new: (comment from inherited method)
return an instance of myself without indexed variables

parsing

parse: aStringOrStream

parse aStringOrStream; answer the parsed document.
For API compatibility with XMLParser

Usage example(s):

self parse:'hello world - this is easy' self parse:'hello < world > - this is easy' self parse:'hello world this is easy' self parse:'hello world this is easy' self parse:'hello world foo this is easy' self parse:'

this is easy'

parseFile: aFilename

self parseFile:'../../doc/online/english/top.html'

parseText: aStringOrStream

parse aStringOrStream; answer the parsed document

Usage example(s):

self parseText:'hello world - this is easy' self parseText:'hello < world > - this is easy' self parseText:'hello world this is easy' self parseText:'hello world this is easy' self parseText:'hello world foo this is easy' self parseText:'

this is easy' self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:('../../doc/online/english/TOP.html' asFilename readStream) self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:'Bönnigheim - Startseite Bönnigheim' characterEncoding:#utf8 self parseText:('/Volumes/tmp/ebayParseError.html' asFilename contentsOfEntireFile asString) characterEncoding:#utf8

parseText: aStringOrStream characterEncoding: anEncodingString

parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document

Usage example(s):

     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8

parseText: aStringOrStream characterEncoding: anEncodingString unescapeAttributeValues: unescapeAttributeValuesBoolean

parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document

Usage example(s):

     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8

Instance protocol:

accessing

canonicalTags: aBoolean: if true (the default), parsed tags are all converted to lowercase;
if false, they are kept as found in the HTML source
(only set to false for very special aplications)
characterEncoding: aString: set the character set / ecoding for the following text
docType
validate: aBoolean: turn off validation by passing false (which is the default, btw.)
validating

error reporting

infoMessage: msg: emits a warning about some strange scanner/parser error (but non-fatal).
Bad naming; should be called warningMessage, probably

initialization

initStates: see https://html.spec.whatwg.org/multipage/parsing.html#tokenization 13.2.5.72
initialize: unescapeTextContent := true.

private

addElement: anElement

currentElement is the currently open element (eg. <table>);
its parent is on the elementStack.
anElement is an incoming new element start (eg. <tr>).

If anElement is allowed as child of the current element
then
add the new element to it.

if anElement (the new one) can have children, push the previous current element
onto the stack, and make the new element the current element.
otherwise,
leave current element as is.

addProcessingInstruction: aProcessingInstruction

addText: aString

self error:'Text after end of html ignored' mayProceed:true.

beTagClose

beTagOpen

classForType: aTypeSymbol

internal interface - return a markup element's class, given a typeSymbol
(such as #b, #pre or #'/pre')

elementFor: aString

given a marks string (such as 'b', 'pre' or '/pre'),
return a new markup instance

elementForTag: tagString attributeNamesAndValues: attrNamesAndValues

endComment

endElement: markupTextInclSlash

^ self endElement_old:markupTextInclSlash

endElementTag: tag

wrong nesting; closing tag but not the current element's tag

endElement_new: markupTextInclSlash

remove the slash

endElement_old: markupText

self assert:(currentElement mustBeClosed not).

endTag

finishTextBlock

finish a scanned textBlock; add it to the markup list

inPre

return true, if currently in a pre element.
(Do not strip separators of a text block if inside a pre)

initializeTagBuffer

initializeTagBuffer: forTagOpem

initializeTemporaryBuffer

parseDOCTYPE: in

parseMarkup

'<' has been detected; parse and return a markup element

parseScript: inStreamArg

parse the contents of a <script> element
according to: https://www.w3.org/TR/html52/semantics-scripting.html#script-content-restrictions.
Read everything up-to the </script>.
Do not expand ampersand escapes; ignore other markup.

Usage example(s):

        self new parseScript:'bla bla bla ' readStream
        self new parseScript:'bla bla bla ' readStream
        self new parseScript:'' readStream
        self new parseScript:'' readStream
     Illegal:
        self new parseScript:'bla bla bla ') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'!-->') printString    
        =
        ((HTMLParser new) parseText_new:'!-->') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )

     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a & b') printString    
        =
        ((HTMLParser new) parseText_new:'a & b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a && b') printString    
        =
        ((HTMLParser new) parseText_new:'a && b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a &; b') printString    
        =
        ((HTMLParser new) parseText_new:'a &; b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'a &x; b') printString    
        =
        ((HTMLParser new) parseText_new:'a &x; b') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'I''m ∉ I tell you') printString    
        =
        ((HTMLParser new) parseText_new:'I''m ∉ I tell you') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'I''m ¬it; I tell you') printString    
        =
        ((HTMLParser new) parseText_new:'I''m ¬it; I tell you') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:'') printString    
        =
        ((HTMLParser new) parseText_new:'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:c'') printString    
        =
        ((HTMLParser new) parseText_new:c'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:c'\nI''m ¬it;\n\nI tell you-->') printString    
        =
        ((HTMLParser new) parseText_new:c'') printString    
     )
     self assert:(
        ((HTMLParser new) parseText_old:c'') printString    
        =
        ((HTMLParser new) parseText_new:c'') printString    
     )


     old version is buggy:
     ((HTMLParser new) parseText_old:'') printString    
     ((HTMLParser new) parseText_new:'') printString    

     ((HTMLParser new) 
        parseText_old:('../../doc/online/english/TOP.html' 
                        asFilename contentsOfEntireFile asString))
     =
     ((HTMLParser new) 
        parseText_new:('../../doc/online/english/TOP.html' 
                        asFilename contentsOfEntireFile asString))

     ((HTMLParser new) 
        parseText_new:('../../doc/online/english/TOP.html' 
                        asFilename readStream)) printString
     =
     ((HTMLParser new) 
        parseText_old:('../../doc/online/english/TOP.html' 
                        asFilename readStream)) printString

     DiffTextView
        openOn:((HTMLParser new) 
                    parseText_new:('../../doc/online/english/TOP.html' 
                                    asFilename readStream)) printString
        and:((HTMLParser new) 
                    parseText_old:('../../doc/online/english/TOP.html' 
                                    asFilename readStream)) printString.

     DiffTextView
        openOn:((HTMLParser new) 
                    parseText_new:('../../doc/online/english/programming/viewintro.html' 
                                    asFilename readStream)) printString
        and:((HTMLParser new) 
                    parseText_old:('../../doc/online/english/programming/viewintro.html' 
                                    asFilename readStream)) printString.

parseText_old: aStringOrStream

parse some string, return a tree of markups

Usage example(s):

     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world this is easy'    
     (HTMLParser new) parseText:'hello
world
foo
 this is easy'    
     (HTMLParser new) parseText_old:'
 this is easy'    

     (HTMLParser new) 
        parseText:('../../doc/online/english/TOP.html' 
                        asFilename contentsOfEntireFile asString)

     (HTMLParser new) 
        parseText:('../../doc/online/english/TOP.html' 
                        asFilename readStream)

     (HTMLParser new) 
        parseText:('../../doc/online/english/programming/viewintro.html' 
                        asFilename contentsOfEntireFile asString)

scanning

ampersandEscape

parse an ampersand escape; the '&' has already been read.

ampersandEscape: aString

return a new string, containing the ampersand escape character.
Expects aString to NOT contain the initial ampersand.

Usage example(s):

     (HTMLParser new) ampersandEscape:'lt'
     (HTMLParser new) ampersandEscape:'ouml'
     (HTMLParser new) ampersandEscape:'#32'
     (HTMLParser new) ampersandEscape:'#x32'
     (HTMLParser new) ampersandEscape:'#X32'
     (HTMLParser new) ampersandEscape:'apos'
     (HTMLParser new) ampersandEscape:'blabla'

     (HTMLParser new) parseText:'hello α β γ normal'
     (HTMLParser new) parseText:'helloworld this is easy'



 ampersandEscapeString

parse an ampersand escape; 

     the '&' has already been read.

     Return the escaped string.




 attributeName

https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#attribute-name

     attribute names must consist of one or more characters 

     other than the space characters, U+0000 NULL, dbl-quote, single-quote, '>', '/', '=', 

     the control characters, and any characters that are not defined by Unicode.




 attributeValueDoubleQuoted

 skip over initial d-quote




 attributeValueSingleQuoted

 skip over initial s-quote




 attributeValueUnquoted

 see html5 spec https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#syntax-attr-unquoted




 collectParametersFrom: parameterTextArg

FIXME: code duplication with HTMLMarkup.

     Old code, no longer used (see parseMarkup_new)




 extractMetaInformationFrom: metaElement

 <mime-type> ; charset=




 skipSeparators



scripts



 parseJavaScriptFrom: scriptStream

HTML




 parseSmalltalkScriptFrom: scriptStream




 script: element

a <script> TAG was encountered.

     check for the language (which defaults to javaScript) and dispatch

     to a script language handler.




 script_javascript: element

a <script language=javaScript> TAG was encountered.

     parse the script, and construct the scriptObject




 script_smalltalkscript: element

a <script language=smalltalkScript> TAG was encountered.

     parse the script, and construct the scriptObject (which has the methods in

     its anonymous class)




Examples:


ElementTypes := nil.
HTMLParser initializeElementTypes
  |p in document|

  p := HTML::HTMLParser new.
  in := '<head>
<? bla bla bla ?>
<!-- bla bla bla -->
<!-- 
bla bla bla -->
<!-- 
bla bla bla 
-->
</head>
' readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new.
  in := '../../doc/online/english/TOP.html' asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect

  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_bestellung'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.

  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkImages'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.

  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkLinks'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new.
  in := '
<?xml version=''1.0'' encoding=''UTF-8''?>
<!DOCTYPE html PUBLIC ''-//W3C//DTD XHTML 1.0 Strict//EN'' ''http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd''>
<html xmlns=''http://www.w3.org/1999/xhtml'' xml:lang=''en'' lang=''en''>
<head profile=''http://selenium-ide.openqa.org/profiles/test-case''>
<meta http-equiv=''Content-Type'' content=''text/html; charset=UTF-8'' />
<link rel=''selenium.base'' href='''' />
<title>New Test</title>
</head>
<body>
<table cellpadding=''1'' cellspacing=''1'' border=''1''>
<thead>
<tr><td rowspan=''1'' colspan=''3''>New Test</td></tr>
</thead><tbody>

</tbody></table>
</body>
</html>
' readStream.
  document := p parseText:in.
  in close.
  document inspect

  |p in document|

  p := HTML::HTMLParser new.
  in := '<html><body>combining: a&#768;rest' readStream.
  document := p parseText:in.
  in close.
  document inspect

  |p document|

  p := HTML::HTMLParser new.
  document := p parseText:'&auml; <script>
  bla &auml; bla
</script> &auml; <div> &auml; </div> <script>
  more bla &auml; bla
</script> '.
  document inspect

ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Tue, 05 Aug 2025 08:20:27 GMT