Author Topic: xml parser  (Read 13721 times)

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
xml parser
« on: 2011-Jan-28 »
I originally made this with DarkBasic and decided to port it over. I was able to make a few changes since GLB lets me use arrays in Types. There is a bug or two. It's adding the closing tag of a root node to the array when it shouldn't.  And the function which can return a tag's inner content doesn't work properly when it includes the inner content of all it's children. I don't have these issues in the DB version so I'm guessing it's something to do with GLB indices being zero-based while DB starts with 1. Or maybe I just copied something over wrong.  There's a text xml file here: http://zimnox.com/quiz.xml

Any time you call xmlReadFile() you should call xmlClear() first.

xmlReadFile(string)
xmlGetElementCount()
xmlGetTagName(int)
xmlGetAttirbuteValue$(int, string)
xmlAttributeExists(int, string)
xmlGetAttributeKey$(int, int)
xmlGetAttributeCount(int)
xmlGetTagContent$(int, bool)
xmlClear()

Code: GLBasic [Select]
// --------------------------------- //
// Project: XMLParser
// Author: Phaelax
// Start: Wednesday, January 26, 2011
// IDE Version: 8.078



TYPE AttributeSet
        key$
        value$
ENDTYPE

TYPE ElementObject
        tagName$
        parentElementId
        content$
        pos
        parentPos
        attributes[0] AS AttributeSet
ENDTYPE

GLOBAL escapes$[]
DIM escapes$[5][2]
escapes$[0][0] = "&lt;"  ; escapes$[0][1] = "<"
escapes$[1][0] = "&gt;"  ; escapes$[1][1] = ">"
escapes$[2][0] = "&amp;" ; escapes$[2][1] = "&"
escapes$[3][0] = "&apos;"; escapes$[3][1] = "&#39;"
escapes$[4][0] = "&quot;"; escapes$[4][1] = CHR$(34)


GLOBAL xmlTags[] AS ElementObject
GLOBAL parseStack[]
DIM xmlTags[0]


//xmlReadFile("c:/quiz.xml")
xmlReadFile("C:/Documents AND Settings/Phaelax.NEWTON64/My Documents/GLBasic/zelda/zelda.gbap")
LOCAL key$
LOCAL y = 0
FOR i = 0 TO xmlGetElementCount()-1
        PRINT i+": "+xmlTags[i].tagName$+" -> "+xmlGetTagContent$(i, FALSE), 50, y;INC y, 10
        FOR j = 0 TO xmlGetAttributeCount(i)-1
                key$ = xmlGetAttributeKey$(i, j)
                PRINT key$ + " -> " + xmlGetAttributeValue$(i, key$), 100, y;INC y, 10
        NEXT
NEXT


SHOWSCREEN
KEYWAIT
END









FUNCTION xmlReadFile:filename$
        LOCAL xmlFileNo = 1
        LOCAL L$, tagName$, c$, oldChar$, temp$, unparsedAttributes$
        LOCAL matchOpenBracket, tagType, strLength, currentTag
        OPENFILE(xmlFileNo, filename$, TRUE)

        WHILE ENDOFFILE(xmlFileNo) = FALSE
                READLINE xmlFileNo, L$
                tagName$ = ""
                matchOpenBracket = -1
                tagType = 0
                strLength = LEN(L$)

                FOR i = 0 TO strLength-1
                        c$ = MID$(L$, i, 1)

                        //////////////////////////////////////////////
                        // open bracket found for new tag
                        //////////////////////////////////////////////
                        IF c$ = "<"
                                matchOpenBracket = i
                                tagType = 0
                        ENDIF

                        //////////////////////////////////////////////
                        // forward slash can either be part of a
                        // closing container tag, or closing an empty
                        //////////////////////////////////////////////
                        IF c$ = "/"
                                //////////////////////////////////////////////
                                // If part of a closing tag, the slash will be
                                // prefixed by the bracket (less-than sign)
                                //////////////////////////////////////////////
                                IF oldChar$ = "<" THEN tagType = 1
                        ENDIF

                        //////////////////////////////////////////////
                        // Closing bracket for a tag
                        //////////////////////////////////////////////
                        IF c$ = ">"
                       
                                //////////////////////////////////////////////
                                // if character before closing bracket was
                                // a slash, then this bracket closed off an
                                // empty tag
                                //////////////////////////////////////////////
                                IF oldChar$ = "/"
                                        tagType = 2
                                ELSE
                                        //////////////////////////////////////////////
                                        // "<? ?>" is part of the XML declaration
                                        //////////////////////////////////////////////
                                        IF oldChar$ = "?"
                                                tagType = 2
                                        ELSE
                                                //////////////////////////////////////////////
                                                // Normal close bracket, standard container
                                                //////////////////////////////////////////////
                                        ENDIF
                                ENDIF
                                //////////////////////////////////////////////
                                // If we closed off (completed) the opening
                                // tag&#39;s bracket, then it&#39;s open as the current
                                // container. Add this tag to the container stack
                                // for tracking the hierarchy and store a new
                                // tag element in the array
                                //////////////////////////////////////////////
                                IF tagType = 0

                                        LOCAL e AS ElementObject
                                        e.pos = matchOpenBracket
                                        temp$ = MID$(L$, matchOpenBracket+1, i-matchOpenBracket-1)
                                        e.tagName$ = TRIM$(UCASE$(LEFT$(temp$, pFindTagNameEndIndex(temp$))))
                                        pParseXmlAttributes(e, TRIM$(RIGHT$(temp$, LEN(temp$)-LEN(e.tagName$))))
                                        e.content$ = ""
                                        //////////////////////////////////////////////
                                        // A parent ID of -1 means it is the root node
                                        //////////////////////////////////////////////
                                        IF LEN(parseStack[]) <= 0
                                                e.parentElementId = -1
                                        ELSE
                                                e.parentElementId = LEN(parseStack[])-1
                                                //////////////////////////////////////////////
                                                // The position within the parent tag&#39;s content
                                                // where this tag&#39;s data is present
                                                //////////////////////////////////////////////
                                                e.parentPos = LEN(xmlTags[e.parentElementId].content$)
                                        ENDIF
                                        DIMPUSH xmlTags[], e

                                        //////////////////////////////////////////////
                                        // Add the index of the last tag element added
                                        // to the xmlTags array to the stack. This keeps
                                        // track of what container we&#39;re in
                                        //////////////////////////////////////////////
                                        DIMPUSH parseStack[], LEN(xmlTags[])-1
                                ENDIF
                                //////////////////////////////////////////////
                                // Closing tag was found, remove last container
                                // from stack
                                //////////////////////////////////////////////
                                IF tagType = 1
                                        DIMDEL parseStack[], -1
                                ENDIF

                                //////////////////////////////////////////////
                                // This was an empty tag element. As they are
                                // not containers, nothing is added to the stack
                                // and nothing needs removed. Create a new
                                // element and add it to the xmlTags array.
                                //////////////////////////////////////////////
                                IF tagType = 2
                                        LOCAL e AS ElementObject
                                       
                                        //////////////////////////////////////////////
                                        // Checks for special case with XML declaration
                                        //////////////////////////////////////////////
                                        IF oldChar$ <> "?"
                                                temp$ = MID$(L$, matchOpenBracket+1, i-matchOpenBracket-2)
                                        ELSE
                                                temp$ = MID$(L$, matchOpenBracket+2, i-matchOpenBracket-3)
                                        ENDIF
                                       
                                        e.tagName$ = TRIM$(UCASE$(LEFT$(temp$, pFindTagNameEndIndex(temp$))))
                                        pParseXmlAttributes(e, TRIM$(RIGHT$(temp$, LEN(temp$)-LEN(e.tagName$))))
                                        e.content$ = ""
                                        IF LEN(parseStack[]) <= 0
                                                e.parentElementId = -1
                                        ELSE
                                                e.parentElementId = LEN(parseStack[])-1
                                                //////////////////////////////////////////////
                                                // The position within the parent tag&#39;s content
                                                // where this tag&#39;s data begins
                                                //////////////////////////////////////////////
                                                e.parentPos = LEN(xmlTags[e.parentElementId].content$)
                                        ENDIF
                                        DIMPUSH xmlTags[], e
                                ENDIF

                                //////////////////////////////////////////////
                                // Start the whole process over again, the
                                // container has been closed.
                                //////////////////////////////////////////////
                                matchOpenBracket = -1
                               
                        ELSE
                                IF matchOpenBracket = -1
                                        LOCAL j = LEN(parseStack[])-1
                                        currentTag = 0
                                        IF j >= 0 THEN currentTag = parseStack[j]
                                        IF currentTag > 0 AND currentTag <= LEN(xmlTags[])
                                                IF LEN(xmlTags[currentTag].content$) > 0
                                                        xmlTags[currentTag].content$ = xmlTags[currentTag].content$ + c$
                                                ELSE
                                                        IF ASC(c$) <> 32 AND ASC(c$) <> 9 THEN xmlTags[currentTag].content$ = xmlTags[currentTag].content$ + c$
                                                ENDIF
                                        ENDIF
                                       
                                ENDIF
                        ENDIF
                        //////////////////////////////////////////////
                        // Helps keep track of previous characters when
                        // checking for forward slashes, which are used
                        // to determine the type of tag
                        //////////////////////////////////////////////
                        oldChar$ = c$
                NEXT
        WEND

        CLOSEFILE xmlFileNo
ENDFUNCTION



FUNCTION xmlClear:
        REDIM xmlTags[0]
ENDFUNCTION



FUNCTION xmlGetElementCount:
        RETURN LEN(xmlTags[])
ENDFUNCTION



FUNCTION xmlGetTagName$:elementId
        RETURN xmlTags[elementId].tagName$
ENDFUNCTION



FUNCTION xmlGetAttributeValue$:elementId, key$
        FOR j = 0 TO xmlGetAttributeCount(elementId)-1
                IF xmlTags[elementId].attributes[j].key$ = key$ THEN RETURN xmlTags[elementId].attributes[j].value$
        NEXT
ENDFUNCTION



FUNCTION xmlAttributeExists:elementId, key$
        FOR j = 0 TO LEN(xmlTags[elementId].attributes[])-1
                IF xmlTags[elementId].attributes[j].key$ = key$ THEN RETURN TRUE
        NEXT
        RETURN FALSE
ENDFUNCTION



FUNCTION xmlGetAttributeKey$:elementId, index
        RETURN xmlTags[elementId].attributes[index].key$
ENDFUNCTION



FUNCTION xmlGetAttributeCount:elementId
        RETURN LEN(xmlTags[elementId].attributes[])
ENDFUNCTION



FUNCTION xmlGetTagContent$:elementId, includeChildren
        LOCAL content$ = xmlTags[elementId].content$
        IF includeChildren = TRUE
                LOCAL extendedLength = 0
                FOR i = 0 TO LEN(xmlTags[])-1
                        IF xmlTags[i].parentElementId = elementId
                                content$ = pInsertString$(content$, xmlTags[i].content$, xmlTags[i].parentPos + extendedLength)
                                extendedLength = extendedLength + LEN(xmlTags[i].content$)
                        ENDIF
                NEXT
        ENDIF
        RETURN content$
ENDFUNCTION


FUNCTION pParseXmlAttributes:element AS ElementObject, txt$
        LOCAL s=0, x=0, s1=0, quote=34
        LOCAL key$, value$

        FOR j = 0 TO LEN(txt$)-1
                x = INSTR(txt$, "=", s)
                key$ = UCASE$(TRIM$(MID$(txt$, s, x-s)))
                s = INSTR(txt$, CHR$(34), x)+1
                s1 = INSTR(txt$, CHR$(39), x)+1
               
                quote = 34
                IF s1 > 0
                        IF s1 < s OR s < 1
                                s = s1
                                quote = 39
                        ENDIF
                ENDIF
                x = INSTR(txt$, CHR$(quote), s)
               
                value$ = MID$(txt$, s, x-s)
                FOR k = 0 TO BOUNDS(escapes$[], 0)-1
                        value$ = REPLACE$(value$, escapes$[k][0], escapes$[k][1])
                NEXT
               
                LOCAL a AS AttributeSet
                a.key$ = key$
                a.value$ = value$
                DIMPUSH element.attributes[], a
               
                s = x+1
                j = x
        NEXT
ENDFUNCTION



FUNCTION pFindTagNameEndIndex:tagLine$
        LOCAL L = LEN(tagLine$)
        FOR i = 0 TO L-1
                IF MID$(tagLine$, i, 1) = " " THEN RETURN i
        NEXT
        RETURN L
ENDFUNCTION



FUNCTION pInsertString$:source$, seg$, pos
        LOCAL t$ = LEFT$(source$, pos)
        source$ = t$ + seg$ + RIGHT$(source$, LEN(source$)-LEN(t$))
        RETURN source$
ENDFUNCTION
 
« Last Edit: 2011-Jan-29 by phaelax »

Offline Moru

  • Administrator
  • Prof. Inline
  • *******
  • Posts: 1774
    • View Profile
    • Homepage
Re: xml parser
« Reply #1 on: 2011-Jan-28 »
Lots of comments, nice! My xml-parser is not this complete so I will use yours instead :-)

Offline Kitty Hello

  • code monkey
  • Administrator
  • Prof. Inline
  • *******
  • Posts: 10697
  • here on my island the sea says 'hello'
    • View Profile
    • http://www.glbasic.com
Re: xml parser
« Reply #2 on: 2011-Jan-28 »
Can you parse the gpap files (GLBasic project files) with this? That would be... like awesome.

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
Re: xml parser
« Reply #3 on: 2011-Jan-28 »
Theoretically it should parse the gbap files since they're xml.  Just tested it, but seems I have a bug parsing the attributes for closed tags. I'll work on it some more

Offline Wampus

  • Prof. Inline
  • *****
  • Posts: 1004
    • View Profile
Re: xml parser
« Reply #4 on: 2011-Jan-28 »
Oh! Keep debugging.  :good:

This is rather awesome. To be able to parse xml in GLBasic would open up some interesting possibilities.

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
Re: xml parser
« Reply #5 on: 2011-Jan-28 »
I got it to parse everything now, as far as I can tell anyway. I'll post the new code here in a minute which will extract all tags/attributes from the project file.  I just want to make a correction to the xml declaration tag name, which shows "?xml" instead of just "xml".  Also, right now attributes only work with double quotes, not single quotes. I want to fix that too.


Does anyone know if mixing single and double quotes around an attribute value is permitted or do they have to match?
Ex. 
something = "puppy'

Well, I found either a bug in the INSTR command or in the help documentation.  The help says INSTR returns -1 if the substring isn't found, however, it actually returns 0.  Considering 0 could be the first character in the string, I'd say its a bug in the command.  Luckily, I can safely assume a quote or double quote will never be the first character on a line.
« Last Edit: 2011-Jan-28 by phaelax »

MrTAToad

  • Guest
Re: xml parser
« Reply #6 on: 2011-Jan-28 »
Have you updated your beta copy ?  Previously INSTR did have a bug!

I looked at the DBPro code ages ago - didn't know it was you who wrote it!

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
Re: xml parser
« Reply #7 on: 2011-Jan-29 »
Well I'm just using the free version of GLB.  Right now I have a headache trying to track down one little bug.  If you look at a gbap file, the closing tag for GLBASIC is being added as it's inner content and I have no clue why. So basically it says the inner content of GLBASIC is "</GLBASIC" and it doesn't do this on any other tags.   I'll update the code above to what I have now.

MrTAToad

  • Guest
Re: xml parser
« Reply #8 on: 2011-Jan-29 »
At the moment, I dont think it can handle ?XML at the begining of GLBasic XML project files...

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
Re: xml parser
« Reply #9 on: 2011-Jan-29 »
It should now, and should now find the attributes which using either double or single quotes.  And I've fixed the bug I just described above.  Like I figured, it was an issue with the DBP/GLB conversion where 0 is the start of a string and not 1. Basically, all I had to do was set matchOpenBracket to -1 instead of 0 and check for that condition when adding the content.

Everything should work now, I've already updated the code in the first post.  Give it a try with one of your project files.

MrTAToad

  • Guest
Re: xml parser
« Reply #10 on: 2011-Jan-29 »
Will do!

Will need to fully examine the output, but it certainly looks correct...  Now it just needs to be in an extended TYPE :)
« Last Edit: 2011-Jan-29 by MrTAToad »

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
Re: xml parser
« Reply #11 on: 2011-Jan-29 »
According to Wikipedia, there are 5 predefined escape entities.  I've added checks for those within the attribute parser. Additional entities can be easily added.

One thing I've forgot to consider was comments, which if added to an xml file will lock it up right now.  So that'll be my next task.

Offline Kitty Hello

  • code monkey
  • Administrator
  • Prof. Inline
  • *******
  • Posts: 10697
  • here on my island the sea says 'hello'
    • View Profile
    • http://www.glbasic.com
Re: xml parser
« Reply #12 on: 2011-Feb-07 »
<drumroll>
Can you parse this file:
http://www.glbasic.com/help/glbasic_e.xml

That would be .. awesome!

No hesitating, though.

Offline phaelax

  • Mc. Print
  • *
  • Posts: 36
    • View Profile
Re: xml parser
« Reply #13 on: 2011-Feb-15 »
It locks up with the glbasic_e file, I'll need to look at it. (I've been outa town for 2 weeks, hence my silence)

Apparently the lines in the file are too long for GLB.  The first XENTRY tag in the file is over 21k characters long and it's being broken up internally into several READLINE commands. It's technically breaking the single line up into 21 lines, so I'm thinking GLB has a 1k character limit per READLINE.

You can test this yourself with this snippet and the attached file.
Code: GLBasic [Select]
OPENFILE(1, "c:/data.txt", TRUE)
LOCAL i = 0
LOCAL L$
WHILE ENDOFFILE(1) = FALSE
        READLINE 1, L$
        PRINT L$, 1, i*10;INC i
WEND
CLOSEFILE 1
PRINT "Line count: "+i, 1, 20+i*10
SHOWSCREEN
KEYWAIT
 


I did try reading this same example file into DarkBasic, and it just crashes.

[attachment deleted by admin]
« Last Edit: 2011-Feb-15 by phaelax »

MrTAToad

  • Guest
Re: xml parser
« Reply #14 on: 2011-Feb-15 »
Yes, I think READLINE is limited to around 1K or so...