xml parser

Previous topic - Next topic

phaelax

I originally made this with DarkBasic and decided to port it over. I was able to make a few changes since GLB lets me use arrays in Types. There is a bug or two. It's adding the closing tag of a root node to the array when it shouldn't.  And the function which can return a tag's inner content doesn't work properly when it includes the inner content of all it's children. I don't have these issues in the DB version so I'm guessing it's something to do with GLB indices being zero-based while DB starts with 1. Or maybe I just copied something over wrong.  There's a text xml file here: http://zimnox.com/quiz.xml

Any time you call xmlReadFile() you should call xmlClear() first.

xmlReadFile(string)
xmlGetElementCount()
xmlGetTagName(int)
xmlGetAttirbuteValue$(int, string)
xmlAttributeExists(int, string)
xmlGetAttributeKey$(int, int)
xmlGetAttributeCount(int)
xmlGetTagContent$(int, bool)
xmlClear()

Code (glbasic) Select

// --------------------------------- //
// Project: XMLParser
// Author: Phaelax
// Start: Wednesday, January 26, 2011
// IDE Version: 8.078



TYPE AttributeSet
key$
value$
ENDTYPE

TYPE ElementObject
tagName$
parentElementId
content$
pos
parentPos
attributes[0] AS AttributeSet
ENDTYPE

GLOBAL escapes$[]
DIM escapes$[5][2]
escapes$[0][0] = "&lt;"  ; escapes$[0][1] = "<"
escapes$[1][0] = "&gt;"  ; escapes$[1][1] = ">"
escapes$[2][0] = "&amp;" ; escapes$[2][1] = "&"
escapes$[3][0] = "&apos;"; escapes$[3][1] = "'"
escapes$[4][0] = "&quot;"; escapes$[4][1] = CHR$(34)


GLOBAL xmlTags[] AS ElementObject
GLOBAL parseStack[]
DIM xmlTags[0]


//xmlReadFile("c:/quiz.xml")
xmlReadFile("C:/Documents AND Settings/Phaelax.NEWTON64/My Documents/GLBasic/zelda/zelda.gbap")
LOCAL key$
LOCAL y = 0
FOR i = 0 TO xmlGetElementCount()-1
PRINT i+": "+xmlTags[i].tagName$+" -> "+xmlGetTagContent$(i, FALSE), 50, y;INC y, 10
FOR j = 0 TO xmlGetAttributeCount(i)-1
key$ = xmlGetAttributeKey$(i, j)
PRINT key$ + " -> " + xmlGetAttributeValue$(i, key$), 100, y;INC y, 10
NEXT
NEXT


SHOWSCREEN
KEYWAIT
END









FUNCTION xmlReadFile:filename$
LOCAL xmlFileNo = 1
LOCAL L$, tagName$, c$, oldChar$, temp$, unparsedAttributes$
LOCAL matchOpenBracket, tagType, strLength, currentTag
OPENFILE(xmlFileNo, filename$, TRUE)

WHILE ENDOFFILE(xmlFileNo) = FALSE
READLINE xmlFileNo, L$
tagName$ = ""
matchOpenBracket = -1
tagType = 0
strLength = LEN(L$)

FOR i = 0 TO strLength-1
c$ = MID$(L$, i, 1)

//////////////////////////////////////////////
// open bracket found for new tag
//////////////////////////////////////////////
IF c$ = "<"
matchOpenBracket = i
tagType = 0
ENDIF

//////////////////////////////////////////////
// forward slash can either be part of a
// closing container tag, or closing an empty
//////////////////////////////////////////////
IF c$ = "/"
//////////////////////////////////////////////
// If part of a closing tag, the slash will be
// prefixed by the bracket (less-than sign)
//////////////////////////////////////////////
IF oldChar$ = "<" THEN tagType = 1
ENDIF

//////////////////////////////////////////////
// Closing bracket for a tag
//////////////////////////////////////////////
IF c$ = ">"

//////////////////////////////////////////////
// if character before closing bracket was
// a slash, then this bracket closed off an
// empty tag
//////////////////////////////////////////////
IF oldChar$ = "/"
tagType = 2
ELSE
//////////////////////////////////////////////
// "<? ?>" is part of the XML declaration
//////////////////////////////////////////////
IF oldChar$ = "?"
tagType = 2
ELSE
//////////////////////////////////////////////
// Normal close bracket, standard container
//////////////////////////////////////////////
ENDIF
ENDIF
//////////////////////////////////////////////
// If we closed off (completed) the opening
// tag's bracket, then it's open as the current
// container. Add this tag to the container stack
// for tracking the hierarchy and store a new
// tag element in the array
//////////////////////////////////////////////
IF tagType = 0

LOCAL e AS ElementObject
e.pos = matchOpenBracket
temp$ = MID$(L$, matchOpenBracket+1, i-matchOpenBracket-1)
e.tagName$ = TRIM$(UCASE$(LEFT$(temp$, pFindTagNameEndIndex(temp$))))
pParseXmlAttributes(e, TRIM$(RIGHT$(temp$, LEN(temp$)-LEN(e.tagName$))))
e.content$ = ""
//////////////////////////////////////////////
// A parent ID of -1 means it is the root node
//////////////////////////////////////////////
IF LEN(parseStack[]) <= 0
e.parentElementId = -1
ELSE
e.parentElementId = LEN(parseStack[])-1
//////////////////////////////////////////////
// The position within the parent tag's content
// where this tag's data is present
//////////////////////////////////////////////
e.parentPos = LEN(xmlTags[e.parentElementId].content$)
ENDIF
DIMPUSH xmlTags[], e

//////////////////////////////////////////////
// Add the index of the last tag element added
// to the xmlTags array to the stack. This keeps
// track of what container we're in
//////////////////////////////////////////////
DIMPUSH parseStack[], LEN(xmlTags[])-1
ENDIF
//////////////////////////////////////////////
// Closing tag was found, remove last container
// from stack
//////////////////////////////////////////////
IF tagType = 1
DIMDEL parseStack[], -1
ENDIF

//////////////////////////////////////////////
// This was an empty tag element. As they are
// not containers, nothing is added to the stack
// and nothing needs removed. Create a new
// element and add it to the xmlTags array.
//////////////////////////////////////////////
IF tagType = 2
LOCAL e AS ElementObject

//////////////////////////////////////////////
// Checks for special case with XML declaration
//////////////////////////////////////////////
IF oldChar$ <> "?"
temp$ = MID$(L$, matchOpenBracket+1, i-matchOpenBracket-2)
ELSE
temp$ = MID$(L$, matchOpenBracket+2, i-matchOpenBracket-3)
ENDIF

e.tagName$ = TRIM$(UCASE$(LEFT$(temp$, pFindTagNameEndIndex(temp$))))
pParseXmlAttributes(e, TRIM$(RIGHT$(temp$, LEN(temp$)-LEN(e.tagName$))))
e.content$ = ""
IF LEN(parseStack[]) <= 0
e.parentElementId = -1
ELSE
e.parentElementId = LEN(parseStack[])-1
//////////////////////////////////////////////
// The position within the parent tag's content
// where this tag's data begins
//////////////////////////////////////////////
e.parentPos = LEN(xmlTags[e.parentElementId].content$)
ENDIF
DIMPUSH xmlTags[], e
ENDIF

//////////////////////////////////////////////
// Start the whole process over again, the
// container has been closed.
//////////////////////////////////////////////
matchOpenBracket = -1

ELSE
IF matchOpenBracket = -1
LOCAL j = LEN(parseStack[])-1
currentTag = 0
IF j >= 0 THEN currentTag = parseStack[j]
IF currentTag > 0 AND currentTag <= LEN(xmlTags[])
IF LEN(xmlTags[currentTag].content$) > 0
xmlTags[currentTag].content$ = xmlTags[currentTag].content$ + c$
ELSE
IF ASC(c$) <> 32 AND ASC(c$) <> 9 THEN xmlTags[currentTag].content$ = xmlTags[currentTag].content$ + c$
ENDIF
ENDIF

ENDIF
ENDIF
//////////////////////////////////////////////
// Helps keep track of previous characters when
// checking for forward slashes, which are used
// to determine the type of tag
//////////////////////////////////////////////
oldChar$ = c$
NEXT
WEND

CLOSEFILE xmlFileNo
ENDFUNCTION



FUNCTION xmlClear:
REDIM xmlTags[0]
ENDFUNCTION



FUNCTION xmlGetElementCount:
RETURN LEN(xmlTags[])
ENDFUNCTION



FUNCTION xmlGetTagName$:elementId
RETURN xmlTags[elementId].tagName$
ENDFUNCTION



FUNCTION xmlGetAttributeValue$:elementId, key$
FOR j = 0 TO xmlGetAttributeCount(elementId)-1
IF xmlTags[elementId].attributes[j].key$ = key$ THEN RETURN xmlTags[elementId].attributes[j].value$
NEXT
ENDFUNCTION



FUNCTION xmlAttributeExists:elementId, key$
FOR j = 0 TO LEN(xmlTags[elementId].attributes[])-1
IF xmlTags[elementId].attributes[j].key$ = key$ THEN RETURN TRUE
NEXT
RETURN FALSE
ENDFUNCTION



FUNCTION xmlGetAttributeKey$:elementId, index
RETURN xmlTags[elementId].attributes[index].key$
ENDFUNCTION



FUNCTION xmlGetAttributeCount:elementId
RETURN LEN(xmlTags[elementId].attributes[])
ENDFUNCTION



FUNCTION xmlGetTagContent$:elementId, includeChildren
LOCAL content$ = xmlTags[elementId].content$
IF includeChildren = TRUE
LOCAL extendedLength = 0
FOR i = 0 TO LEN(xmlTags[])-1
IF xmlTags[i].parentElementId = elementId
content$ = pInsertString$(content$, xmlTags[i].content$, xmlTags[i].parentPos + extendedLength)
extendedLength = extendedLength + LEN(xmlTags[i].content$)
ENDIF
NEXT
ENDIF
RETURN content$
ENDFUNCTION


FUNCTION pParseXmlAttributes:element AS ElementObject, txt$
LOCAL s=0, x=0, s1=0, quote=34
LOCAL key$, value$

FOR j = 0 TO LEN(txt$)-1
x = INSTR(txt$, "=", s)
key$ = UCASE$(TRIM$(MID$(txt$, s, x-s)))
s = INSTR(txt$, CHR$(34), x)+1
s1 = INSTR(txt$, CHR$(39), x)+1

quote = 34
IF s1 > 0
IF s1 < s OR s < 1
s = s1
quote = 39
ENDIF
ENDIF
x = INSTR(txt$, CHR$(quote), s)

value$ = MID$(txt$, s, x-s)
FOR k = 0 TO BOUNDS(escapes$[], 0)-1
value$ = REPLACE$(value$, escapes$[k][0], escapes$[k][1])
NEXT

LOCAL a AS AttributeSet
a.key$ = key$
a.value$ = value$
DIMPUSH element.attributes[], a

s = x+1
j = x
NEXT
ENDFUNCTION



FUNCTION pFindTagNameEndIndex:tagLine$
LOCAL L = LEN(tagLine$)
FOR i = 0 TO L-1
IF MID$(tagLine$, i, 1) = " " THEN RETURN i
NEXT
RETURN L
ENDFUNCTION



FUNCTION pInsertString$:source$, seg$, pos
LOCAL t$ = LEFT$(source$, pos)
source$ = t$ + seg$ + RIGHT$(source$, LEN(source$)-LEN(t$))
RETURN source$
ENDFUNCTION


Moru

Lots of comments, nice! My xml-parser is not this complete so I will use yours instead :-)

Kitty Hello

Can you parse the gpap files (GLBasic project files) with this? That would be... like awesome.

phaelax

Theoretically it should parse the gbap files since they're xml.  Just tested it, but seems I have a bug parsing the attributes for closed tags. I'll work on it some more

Wampus

Oh! Keep debugging.  :good:

This is rather awesome. To be able to parse xml in GLBasic would open up some interesting possibilities.

phaelax

#5
I got it to parse everything now, as far as I can tell anyway. I'll post the new code here in a minute which will extract all tags/attributes from the project file.  I just want to make a correction to the xml declaration tag name, which shows "?xml" instead of just "xml".  Also, right now attributes only work with double quotes, not single quotes. I want to fix that too.


Does anyone know if mixing single and double quotes around an attribute value is permitted or do they have to match?
Ex. 
something = "puppy'

Well, I found either a bug in the INSTR command or in the help documentation.  The help says INSTR returns -1 if the substring isn't found, however, it actually returns 0.  Considering 0 could be the first character in the string, I'd say its a bug in the command.  Luckily, I can safely assume a quote or double quote will never be the first character on a line.

MrTAToad

Have you updated your beta copy ?  Previously INSTR did have a bug!

I looked at the DBPro code ages ago - didn't know it was you who wrote it!

phaelax

Well I'm just using the free version of GLB.  Right now I have a headache trying to track down one little bug.  If you look at a gbap file, the closing tag for GLBASIC is being added as it's inner content and I have no clue why. So basically it says the inner content of GLBASIC is "</GLBASIC" and it doesn't do this on any other tags.   I'll update the code above to what I have now.

MrTAToad

At the moment, I dont think it can handle ?XML at the begining of GLBasic XML project files...

phaelax

It should now, and should now find the attributes which using either double or single quotes.  And I've fixed the bug I just described above.  Like I figured, it was an issue with the DBP/GLB conversion where 0 is the start of a string and not 1. Basically, all I had to do was set matchOpenBracket to -1 instead of 0 and check for that condition when adding the content.

Everything should work now, I've already updated the code in the first post.  Give it a try with one of your project files.

MrTAToad

#10
Will do!

Will need to fully examine the output, but it certainly looks correct...  Now it just needs to be in an extended TYPE :)

phaelax

According to Wikipedia, there are 5 predefined escape entities.  I've added checks for those within the attribute parser. Additional entities can be easily added.

One thing I've forgot to consider was comments, which if added to an xml file will lock it up right now.  So that'll be my next task.

Kitty Hello

<drumroll>
Can you parse this file:
http://www.glbasic.com/help/glbasic_e.xml

That would be .. awesome!

No hesitating, though.

phaelax

#13
It locks up with the glbasic_e file, I'll need to look at it. (I've been outa town for 2 weeks, hence my silence)

Apparently the lines in the file are too long for GLB.  The first XENTRY tag in the file is over 21k characters long and it's being broken up internally into several READLINE commands. It's technically breaking the single line up into 21 lines, so I'm thinking GLB has a 1k character limit per READLINE.

You can test this yourself with this snippet and the attached file.
Code (glbasic) Select

OPENFILE(1, "c:/data.txt", TRUE)
LOCAL i = 0
LOCAL L$
WHILE ENDOFFILE(1) = FALSE
READLINE 1, L$
PRINT L$, 1, i*10;INC i
WEND
CLOSEFILE 1
PRINT "Line count: "+i, 1, 20+i*10
SHOWSCREEN
KEYWAIT



I did try reading this same example file into DarkBasic, and it just crashes.

[attachment deleted by admin]

MrTAToad

Yes, I think READLINE is limited to around 1K or so...