Class: XMLParser (in XML)



rev: 1.99 date: 2023/10/23 15:31:42
user: stefan
file: XMLParser.st directory: goodies/xml/vw
module: stx stc-classLibrary: vw


This class represents the main XML processor in the system. 
This XMLParser may be used as a validating or non-validating parser to scan and process an XML document 
and provide access to it's content and structure to a smalltalk application. 

This XMLParser tries to follow the guidelines laid out in the W3C XML Version 1.0 Recommendation, 
plus the XML Namespaces Recommendation.

Instance Variables:
    sourceStack     <XML.StreamWrapper>  stack of input streams that handles inclusion.
    hereChar        <Character>  the current character being parsed
    lastSource      <XML.StreamWrapper>  record of previous source used to check correct nesting
    currentSource   <XML.StreamWrapper>  current input stream (the top of sourceStack)
    documentNode    <XML.Document>  the document created by parsing
    dtd     <XML.DocumentType>  the document type definition for the current document
    unresolvedIDREFs        <Collection>  collection of IDREfs that have yet to be resolved; used for validation
    builder <XML.NodeBuilder>  node builder
    validating      <Boolean>  if true then the parse validates the XML
    ignore  <Boolean>  ?
    eol     <Character>  the end-of-line character in the source stream

Class protocol:

attribute processing
o  isValidName: aTag

o  isValidNmToken: aTag

o  defaultNormalizeAttributes: aBoolean

o  concreteClass
return the concrete parser class, per smalltalk dialect

instance creation
o  new
(comment from inherited method)
return an instance of myself without indexed variables

o  on: aStream

o  on: aStream protocol: protocolString name: name

o  parse: aStringOrStream
parse the xml in aStringOrStream;
return a DOM-tree

o  parseDtdAsPatterns: aStringOrStream
parse a document type from aStringOrStream.
Do not normalize the DTD patterns, so they can be used for type construction.
Answer a XML::DocumentType.

o  parseDtdString: aStringOrStream
parse a dtd from a aStringOrStream

o  parseFile: aFilename

o  parseText: aStringOrStream
parse the xml in aStringOrStream;
return a DOM-tree.
For API compatibility with HTMLParser

o  processDocumentInFilename: aFilename

o  processDocumentInFilename: aFilename beforeScanDo: aBlock

o  processDocumentStream: aStream

o  processDocumentStream: aStream beforeScanDo: aBlock

o  processDocumentString: aString

o  processDocumentString: aString beforeScanDo: aBlock

o  isValidNameChar: c
cg: this is not correct:
^ c isLetterOrDigit or: [c == $- or:[c ==$_]]
a name may also contain much more...

o  isValidNameStart: c
cg: this is not correct;
^ c isLetter or: [c ==$_ ]
a name may contain much more...

o  invalid: aString

o  malformed: aString

o  mapEncoding: anEncoding
visualworks specific: map xml-encoding names to vw encodedStream names

o  warn: aString
Added to unify warnings for SAX. REW

Instance protocol:

DTD processing
o  conditionalSect

o  dtdEntry

o  dtdFile: newURI
So we don't lose hereChar.

o  dtdStream: aStream rootElement: rootElementNameString
set the DTD from the contents of aStream

o  externalID: usage
Usage may be #docType, #entity, or #notation.
DocType is treated specially, since PE references are not allowed.
Notation is treated specially since the system identifier of the
PUBLIC form is optional.

o  inInternalSubset

o  markUpDecl

o  notationDecl

o  pubIdLiteral
Modified (format): / 11-06-2021 / 22:39:57 / cg

o  systemLiteral

o  uriResolver

o  checkUnresolvedIDREFs

o  rememberIDREF: anID

o  resolveIDREF: anID

o  builder
return the value of the instance variable 'builder' (automatically generated)

o  document
cg: added for twoFlower *compatibilitz with newer XMLParser framework

o  dtd

o  encoding

o  eol

o  isEncodeChecking

o  isEncodeChecking: aBoolean

o  isTreeBuilding
answer true, if we build a tree of xml elements.
This is false for SAX parsing

o  isTreeBuilding: something

o  normalizeAttributes

o  normalizeAttributes: aBoolean
controls if attribute values like ' foo bar ' are normalized to
'foo bar' or not. The default is true.
If you have to parse non-standard XML, you can set this to false
before parsing

o  normalizeDtd

o  normalizeDtd: something

o  sourceWrapper
Modified (comment): / 23-02-2022 / 00:41:47 / cg

o  validate: aBoolean

o  comment

o  docTypeDecl

o  latestURI

o  misc
comment or PI

o  parseDtd
parse a plain dtd

o  pi

o  prolog
This is optional.

o  pushSource: aStreamWrapper

o  scanDocument
MessageTally spyOn:[

attribute def processing
o  attListDecl

o  completeNotationType

o  defaultDecl
^(self skipIf: '#REQUIRED')

o  enumeration

attribute processing
o  attValue
cg: must eat all other spaces ...
do it here, to limit changes to one place.
Q: is this true?

o  attribute

o  isValidName: arg

o  isValidNmToken: arg

o  processAttributes

o  quotedString

o  validateAttributes: attributes for: tag

element def processing
o  completeChildren: str

o  completeMixedContent: str
we already have the #PCDATA finished.

o  contentsSpec
^(self skipIf: 'ANY')

o  cp

o  elementDecl

element processing
o  charEntity: data startedIn: str1
parse a character entity and add it to data.
cg: separated into parsing the entity and adding to the stream

o  closeTag: tag return: elements

o  completeCDATA: str1
data := CharacterWriteStream on:(String new: 32).

o  completeComment: str1

o  completePI: str1
pi := self upToAll_positionBefore:'?>'

o  element

o  elementAtPosition: startPosition
self mustFind:'<'.

o  elementContent: tag openedIn: str
(data findString: ']]>' startingAt: 1) = 0
ifFalse: [self halt: 'including ]]> in element content'].

o  generalEntityInText: str canBeExternal: external

o  isValidTag: aTag

o  parseCharEntityStartedIn: str1
parse a character entity.
cg: separated into parsing and separate adding to the stream

entity processing
o  PERef: refType
if we are in IGNORE conditional, this is not an error. gj

o  entityDecl
peDef modified for SAX. REW

o  entityDef: entityName
Parameter entityName added for SAX. REW

o  entityValue

o  generalEntity: str

o  nDataDecl
^self skipSpaceInDTD

o  peDef: entityName
Parameter entityName added for SAX. REW

o  builder: anXMLNodeBuilder

o  lineEndLF

o  on: inputStream

o  on: inputStream protocol: protocolString name: name

o  wrapStream: aStream protocol: protocolString name: name

o  checkForWrongRootNode

o  closeAllFiles

o  documentNode

o  error: aStringOrMessage
(comment from inherited method)
Raise an error with error message aString.
The error is reported by raising the Error exception,
which is non-proceedable.
If no handler has been setup, a debugger is entered.

o  expected: string

o  fullSourceStack

o  getDottedName

o  getElement
cg: added for twoFlower *compatibility with newer XMLParser framework

o  getQualifiedName

o  getSimpleName

o  invalid: aString

o  malformed: aString

o  nmToken

o  notPermitted: string

o  validateEncoding: encName
validate the encoding string in encName.
Set the encoding instVar as a side effect.

o  validateText: data from: start to: stop testBlanks: testBlanks
cg: added for twoFlower *compatibilitz with newer XMLParser framework

o  warn: aString
Modfied to unify warn system for SAX, REW

o  with: list add: node

o  atEnd

o  forceSpace

o  forceSpaceInDTD

o  getNextChar

o  mustFind: str

o  nextChar
avoid #atEnd if possible (let #next return nil)

o  skipIf: str

o  skipSpace
answer true, if any whitespace was skipped

o  skipSpaceInDTD

o  upTo: aCharacter
Answer a subcollection from position to the occurrence (if any, exclusive) of anObject.
The stream is left positioned after anObject.
If anObject is not found answer everything.

o  upToAll: target
Answer a subcollection from the current position
up to the occurrence (if any, not inclusive) of target,
and leave the stream positioned after the occurrence.
If no occurrence is found, answer the entire remaining
stream contents, and leave the stream positioned at the end.
We are going to cheat here, and assume that the first
character in the target only occurs once in the target, so
that we don't have to backtrack.

o  upToAll_positionBefore: target
Answer a subcollection from the current position
up to the occurrence (if any, not inclusive) of target,
and leave the stream positioned before the occurrence.
If no occurrence is found, answer the entire remaining
stream contents, and leave the stream positioned at the end.
We are going to cheat here, and assume that the first
character in the target only occurs once in the target, so
that we don't have to backtrack.

o  documentHasDTD

o  hasExpanded: anEntity

o  isIllegalCharacter: anInteger
answer true, if anInteger is an illegal unicode code point in an xml file

o  isValidating

o  shouldTestWFCEntityDeclared


    processDocumentStream:'<HalloWelt />' readStream
    beforeScanDo:[:parser |

    processDocumentStream:'<Hallo_Welt />' readStream
    beforeScanDo:[:parser |

Fails (invalid character):

    processDocumentStream:'<Hallo$Welt />' readStream
    beforeScanDo:[:parser |

