Class: HTMLUtilities



rev: 1.69 date: 2024/04/22 17:41:45
user: stefan
file: HTMLUtilities.st directory: libbasic2
module: stx stc-classLibrary: libbasic2


Collected support functions to deal with HTML.
Used both by HTML generators (DocGenerator), HTMLParsers and the webServer.
Therefore, it has been put into libbasic2.


Class protocol:

common actions
o  openLauncherOnDisplay: displayName
obsolete - do not use

** This is an obsolete interface - do not use it (it may vanish in future versions) **

o  ampersandEscapes
AmpersandEscapes := nil.
self ampersandEscapes at:#nbsp
self ampersandEscapes at:#ordf

o  htmlEntityToCharacter

o  mathAmpersandEscapes
these are obsolete now, as HTML4 added the missing stuff in the meantime.

o  characterFromHtmlEntityNamed: anHtmlEntityName
where to get the mapping???

o  combine: previousChar withDiacriticalMark: markCharacter
in HTML, you may write à to combine the a with a diacritical mark.
This combines a mark with some previous character and returns a string.
Incomplete; only the most common one's are defined here; maybe someone completes it.
(see https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)

Usage example(s):

     self combine:$a withDiacriticalMark:(Character value:0x300). 
     self combine:$A withDiacriticalMark:(Character value:0x300).
     self combine:$A withDiacriticalMark:(Character value:0x308).
     self combine:$u withDiacriticalMark:(Character value:0x308).
     self combine:$1 withDiacriticalMark:(Character value:0x308).  

o  controlCharacters
o  copyReplaceCharactersWithHtmlEntitiesIn: aString

o  escapeCharacterEntities: aString
helper to escape invalid/dangerous characters in html strings.
These are:
control characters,
characters above 0x7F
'<', '&' and space -> %XX ascii as hex digits
% -> %%

Usage example(s):

     self escapeCharacterEntities:'a 'a<b'   
     self escapeCharacterEntities:'aöb'  => 'aöb'   

o  escapeCharacterEntities: aString andControlCharacters: controlCharacters
helper to escape invalid/dangerous characters in html strings.
These are:
control characters,
characters above 0x7F
'<', '>', '&' and space -> %XX ascii as hex digits
% -> %%

Usage example(s):

     self escapeCharacterEntities:'a

o  escapeCharacterEntities: aString andControlCharacters: controlCharacters on: aWriteStream
helper to escape invalid/dangerous characters in html strings.
These are:
control characters,
characters above 0x7F,
'<', '>', '&' and space -> %XX ascii as hex digits
% -> %%

Usage example(s):

     self escapeCharacterEntities:'a

o  escapeCharacterEntities: aString on: aStream
helper to escape invalid/dangerous characters in html strings.
These are:
control characters, '<', '&' and space -> %XX ascii as hex digits
% -> %%

Usage example(s):

     self escapeCharacterEntities:'a

o  extractCharSetEncodingFromContentType: contentTypeLine
self extractCharSetEncodingFromContentType:'text/html; charset=ascii'
self extractCharSetEncodingFromContentType:'text/html; charset='
self extractCharSetEncodingFromContentType:'text/html; fooBar=bla'
self extractCharSetEncodingFromContentType:'text/xml; charset=utf-8'
self extractCharSetEncodingFromContentType:'text/xml; charset=utf-8; bla=fasel'

o  extractMimeTypeFromContentType: contentTypeLine
self extractMimeTypeFromContentType:'text/html; charset=ascii'
self extractMimeTypeFromContentType:'text/html; '
self extractMimeTypeFromContentType:'text/html'
self extractMimeTypeFromContentType:'text/xml; charset=utf-8'

o  htmlEntityForCharacter: aCharacter

o  unEscape: aString
Convert escaped characters in an url's arguments or post fields back to their proper characters.
Undoes the effect of #urlEncoded: and #urlEncoded2:.
These are:
+ -> space
%XX ascii as hex digits
%uXXXX unicode as hex digits NOTE: %u is non-standard bit implemented in MS IIS
%% -> %

Usage example(s):

     self unEscape:'a%20b'   
     self unEscape:'a%%b'
     self unEscape:'a+b' 
     self unEscape:'a%+b' 
     self unEscape:'a%' 
     self unEscape:'a%2' 
     self unEscape:'/Home/a%C3%A4%C3%B6%C3%BCa'

o  unescapeCharacterEntities: aString
helper to unescape character entities in a string.
Normally, this is done by the HTMLParser when it scans text,
but seems to be also used in post-data fields which contain non-ascii characters
(for example: the login postdata of expeccALM).

Sequences are:

From Reference:

Usage example(s):

     self unescapeCharacterEntities:'&;'            
     self unescapeCharacterEntities:'&16368;'            
     self unescapeCharacterEntities:'&16368;&16368'            
     self unescapeCharacterEntities:'&16368;<'            
     self unescapeCharacterEntities:'&16368;<'            
     self unescapeCharacterEntities:'Ϩ'    
     self unescapeCharacterEntities:'က'    
     self unescapeCharacterEntities:'꿾'    
     self unescapeCharacterEntities:'"<foo'      
     self unescapeCharacterEntities:'&funny;<foo'     

o  urlDecoded: aString
Convert escaped characters in an urls arguments or post fields back to their proper characters.
Undoes the effect of #urlEncoded: and #urlEncoded2:.
These are:
+ -> space
%XX ascii as hex digits
%uXXXX unicode as hex digits NOTE: %u is non-standard bit implemented in MS IIS
%% -> %

Usage example(s):

     self urlDecoded:'a%20b'   
     self urlDecoded:'a%%b'
     self urlDecoded:'a+b' 
     self urlDecoded:'a%+b' 
     self urlDecoded:'a%' 
     self urlDecoded:'a%2' 
     self urlDecoded:'/Home/a%C3%A4%C3%B6%C3%BCa'

o  urlEncode2: aStringOrStream on: ws
helper to escape invalid/dangerous characters in an urls arguments.
Similar to urlEncode, but treats '*','~' and spaces differently.
(some clients, such as bitTorrent seem to require this - time will tell...)
Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_', is encoded using
the '%nn' format, where nn is the hexadecimal value of the byte.
see: RFC1738

o  urlEncode: aStringOrStream on: ws
helper to escape invalid/dangerous characters in an url's argument or post-fields.

Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_' and '~',
is encoded using the '%nn' format, where nn is the hexadecimal value of the byte.
Characters outside the ASCII range are encoded into utf8 first.
Spaces are encoded as '+'.
see: application/x-www-form-urlencoded
see: https://tools.ietf.org/html/rfc3986 (obsoletes RFC1738)

o  urlEncoded2: aString
helper to escape invalid/dangerous characters in an urls arguments or post-fields.
Similar to urlEncoded, but treats '*','~' and spaces differently.
(some clients, such as bitTorrent seem to require this - time will tell...)
Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_' and '~', is encoded using
the '%nn' format, where nn is the hexadecimal value of the byte.
see: application/x-www-form-urlencoded
see: RFC1738

o  urlEncoded: aString
helper to escape invalid/dangerous characters in an urls arguments or post-fields.

Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_' and '~', is encoded using
the '%nn' format, where nn is the hexadecimal value of the byte.
Characters outside the ASCII range are encoded into utf8 first.
Spaces are encoded as '+'.
see: application/x-www-form-urlencoded
see: https://tools.ietf.org/html/rfc3986 (obsoletes RFC1738)

Usage example(s):

      self unEscape:(self urlEncoded:'_-.*Frankfurt(Main) Hbf')
      self urlEncoded:'_-.*Frankfurt(Main) Hbf'

      self unEscape:(self urlEncoded:'-_.*%exept;')
      self urlEncoded:'-_.*%exept;'

      self urlEncoded:'Не только в сервере, но и в ComSpec, чтобы дочерние КОНСОЛЬНЫЕ процессы могли пользоваться редиректами'

o  withAllSpecialHTMLCharactersEscaped: aStringOrCharacter
replace ampersand, less, greater and quotes by html-character escapes.
This DOES escape quote and doubleQuote characters.

Usage example(s):

     self withAllSpecialHTMLCharactersEscaped:'<>#&'     
     self withAllSpecialHTMLCharactersEscaped:$<
     self withAllSpecialHTMLCharactersEscaped:$#

o  withSpecialHTMLCharactersEscaped: aStringOrCharacter
replace ampersand, less and greater by html-character escapes.
Does NOT escape percent and control characters.
Does NOT escape quote and doubleQuote characters.

Usage example(s):

     self withSpecialHTMLCharactersEscaped:'<>#&'
     self withSpecialHTMLCharactersEscaped:$<
     self withSpecialHTMLCharactersEscaped:$#

o  isUtilityClass
(comment from inherited method)
a utility class is one which is not to be instantiated,
but only provides a number of utility functions on the class side.
It is usually also abstract

o  escape: aString
helper to escape invalid/dangerous characters in an url's arguments or post-fields.
These are:
control characters, dQuote, '+', ';', '?', '&' and space -> %XX ascii as hex digits
% -> %%

Usage example(s):

     self escape:'a b'      
     self escape:'a%b'    
     self escape:'a b'      
     self escape:'a+b'      
     self escape:'aäüöb'      

text processing helpers
o  convertFromMarkDown: markDownString
given some MarkDown (Wiki), convert to html.

o  convertFromMarkDown: markDownString bodyTag: writeBodyTag
given some MarkDown, convert to html.

Usage example(s):

        mdString := '
# To Do
## At Home
* Wash dishes
* Install winter tires
## At Work
* Finish Report
* Book Team **101** meeting'.

        self convertFromMarkDown:mdString bodyTag:true.

o  convertFromWikiStyle: wikiStyleString
given some wiki text, convert to html.

o  convertFromWikiStyle: wikiStyleString bodyTag: writeBodyTag
given some wiki text, convert to html.

Usage example(s):

       wikiString := '== headline2
=== headline3
=== headline3b ===
* bullet1
* bullet2


       self convertFromWikiStyle:wikiString bodyTag:true.

o  plainTextOfHTML: htmlString
given some HTML, extract the raw text.
Can be used to search for strings in some html text.

Usage example(s):

     self plainTextOfHTML:'
            bla1 bla2 

bla6' self plainTextOfHTML:'Hello World' self plainTextOfHTML:nil

