eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTMLUtilities':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: HTMLUtilities


Inheritance:

   Object
   |
   +--HTMLUtilities

Package:
stx:libbasic2
Category:
Net-Communication-Support
Version:
rev: 1.42 date: 2019/07/26 13:30:30
user: stefan
file: HTMLUtilities.st directory: libbasic2
module: stx stc-classLibrary: libbasic2

Description:


Collected support functions to deal with HTML.
Used both by HTML generators (DocGenerator), HTMLParsers and the webServer.
Therefore, it has been put into libbasic2.


Class protocol:

common actions
o  openLauncherOnDisplay: displayName
obsolete - do not use

** This is an obsolete interface - do not use it (it may vanish in future versions) **

constants
o  ampersandEscapes
non-breakable space - do something magic...

o  htmlEntityToCharacter

o  mathAmpersandEscapes
these are obsolete now, as HTML4 added the missing stuff in the meantime.

helpers
o  characterFromHtmlEntityNamed: anHtmlEntityName
where to get the mapping???

o  controlCharacters
EscapeControlCharacters at:$' put:'''.

o  copyReplaceCharactersWithHtmlEntitiesIn: aString

o  escapeCharacterEntities: aString
helper to escape invalid/dangerous characters in html strings.
These are:
control characters, '<', '&' and space -> %XX ascii as hex digits
% -> %%

usage example(s):

     self escapeCharacterEntities:'a

o  escapeCharacterEntities: aString andControlCharacters: controlCharacters
helper to escape invalid/dangerous characters in html strings.
These are:
control characters, '<', '>', '&' and space -> %XX ascii as hex digits
% -> %%

usage example(s):

     self escapeCharacterEntities:'a

o  escapeCharacterEntities: aString andControlCharacters: controlCharacters on: aWriteStream
helper to escape invalid/dangerous characters in html strings.
These are:
control characters, '<', '>', '&' and space -> %XX ascii as hex digits
% -> %%

usage example(s):

     self escapeCharacterEntities:'a

o  escapeCharacterEntities: aString on: aStream
helper to escape invalid/dangerous characters in html strings.
These are:
control characters, '<', '&' and space -> %XX ascii as hex digits
% -> %%

usage example(s):

     self escapeCharacterEntities:'a

o  extractCharSetEncodingFromContentType: contentTypeLine
self extractCharSetEncodingFromContentType:'text/html; charset=ascii'
self extractCharSetEncodingFromContentType:'text/html; charset='
self extractCharSetEncodingFromContentType:'text/html; fooBar=bla'
self extractCharSetEncodingFromContentType:'text/xml; charset=utf-8'
self extractCharSetEncodingFromContentType:'text/xml; charset=utf-8; bla=fasel'

o  extractMimeTypeFromContentType: contentTypeLine
self extractMimeTypeFromContentType:'text/html; charset=ascii'
self extractMimeTypeFromContentType:'text/html; '
self extractMimeTypeFromContentType:'text/html'
self extractMimeTypeFromContentType:'text/xml; charset=utf-8'

o  htmlEntityForCharacter: aCharacter

o  unEscape: aString
Convert escaped characters in an urls arguments or post fields back to their proper characters.
Undoes the effect of #urlEncoded: and #urlEncoded2:.
These are:
+ -> space
%XX ascii as hex digits
%uXXXX unicode as hex digits NOTE: %u is non-standard bit implemented in MS IIS
%% -> %

usage example(s):

     self unEscape:'a%20b'   
     self unEscape:'a%%b'
     self unEscape:'a+b' 
     self unEscape:'a%+b' 
     self unEscape:'a%' 
     self unEscape:'a%2' 
     self unEscape:'/Home/a%C3%A4%C3%B6%C3%BCa'

o  unescapeCharacterEntities: aString
helper to unescape character entities in a string.
Normally, this is done by the HTMLParser when it scans text,
but seems to be also used in post-data fields which contain non-ascii characters
(for example: the login postdata of expeccALM).

Sequences are:
&<specialName>;
&#<decimal>;
&#x<hex>

From Reference:
http://wiki.selfhtml.org/wiki/Referenz:HTML/Zeichenreferenz#HTML-eigene_Zeichen

usage example(s):

     self unescapeCharacterEntities:'&;'            
     self unescapeCharacterEntities:'&16368;'            
     self unescapeCharacterEntities:'&16368;&16368'            
     self unescapeCharacterEntities:'&16368;<'            
     self unescapeCharacterEntities:'&16368;<'            
     self unescapeCharacterEntities:'꿾'    
     self unescapeCharacterEntities:'"<foo'      
     self unescapeCharacterEntities:'&funny;<foo'     

o  urlDecoded: aString
Convert escaped characters in an urls arguments or post fields back to their proper characters.
Undoes the effect of #urlEncoded: and #urlEncoded2:.
These are:
+ -> space
%XX ascii as hex digits
%uXXXX unicode as hex digits NOTE: %u is non-standard bit implemented in MS IIS
%% -> %

usage example(s):

     self urlDecoded:'a%20b'   
     self urlDecoded:'a%%b'
     self urlDecoded:'a+b' 
     self urlDecoded:'a%+b' 
     self urlDecoded:'a%' 
     self urlDecoded:'a%2' 
     self urlDecoded:'/Home/a%C3%A4%C3%B6%C3%BCa'

o  urlEncode2: aStringOrStream on: ws
helper to escape invalid/dangerous characters in an urls arguments.
Similar to urlEncode, but treats '*','~' and spaces differently.
(some clients, such as bitTorrent seem to require this - time will tell...)
Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_', is encoded using
the '%nn' format, where nn is the hexadecimal value of the byte.
see: RFC1738

** This is an obsolete interface - do not use it (it may vanish in future versions) **

o  urlEncode: aStringOrStream on: ws
helper to escape invalid/dangerous characters in an urlÄs argument or post-fields.

Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_' and '~',
is encoded using the '%nn' format, where nn is the hexadecimal value of the byte.
Characters outside the ASCII range are encoded into utf8 first.
Spaces are encoded as '+'.
see: application/x-www-form-urlencoded
see: https://tools.ietf.org/html/rfc3986 (obsoletes RFC1738)

o  urlEncoded2: aString
helper to escape invalid/dangerous characters in an urls arguments or post-fields.
Similar to urlEncoded, but treats '*','~' and spaces differently.
(some clients, such as bitTorrent seem to require this - time will tell...)
Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_' and '~', is encoded using
the '%nn' format, where nn is the hexadecimal value of the byte.
see: application/x-www-form-urlencoded
see: RFC1738

** This is an obsolete interface - do not use it (it may vanish in future versions) **

o  urlEncoded: aString
helper to escape invalid/dangerous characters in an urls arguments or post-fields.

Any byte not in the set 0-9, a-z, A-Z, '.', '-', '_' and '~', is encoded using
the '%nn' format, where nn is the hexadecimal value of the byte.
Characters outside the ASCII range are encoded into utf8 first.
Spaces are encoded as '+'.
see: application/x-www-form-urlencoded
see: https://tools.ietf.org/html/rfc3986 (obsoletes RFC1738)

usage example(s):

      self unEscape:(self urlEncoded:'_-.*Frankfurt(Main) Hbf')
      self urlEncoded:'_-.*Frankfurt(Main) Hbf'

      self unEscape:(self urlEncoded:'-_.*%exept;')
      self urlEncoded:'-_.*%exept;'

o  withAllSpecialHTMLCharactersEscaped: aStringOrCharacter
replace ampersand, less, greater and quotes by html-character escapes

usage example(s):

     self withAllSpecialHTMLCharactersEscaped:'<>#&'     
     self withAllSpecialHTMLCharactersEscaped:$<
     self withAllSpecialHTMLCharactersEscaped:$#

o  withSpecialHTMLCharactersEscaped: aStringOrCharacter
replace ampersand, less and greater by html-character escapes

usage example(s):

     self withSpecialHTMLCharactersEscaped:'<>#&'
     self withSpecialHTMLCharactersEscaped:$<
     self withSpecialHTMLCharactersEscaped:$#

queries
o  isUtilityClass

serving-helpers
o  escape: aString
helper to escape invalid/dangerous characters in an url's arguments or post-fields.
These are:
control characters, dQuote, '+', ';', '?', '&' and space -> %XX ascii as hex digits
% -> %%

usage example(s):

     self escape:'a b'      
     self escape:'a%b'    
     self escape:'a b'      
     self escape:'a+b'      
     self escape:'aäüöb'      

text processing helpers
o  plainTextOfHTML: htmlString
given some HTML, extract the raw text.
Can be used to search for strings in some html text.

usage example(s):

     self plainTextOfHTML:'
            bla1 bla2 
bla3
bla4
bla5

bla6' self plainTextOfHTML:'Hello World'



ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Tue, 19 Mar 2024 08:44:37 GMT