xml parser

phaelax · 2011-Jan-28

I originally made this with DarkBasic and decided to port it over. I was able to make a few changes since GLB lets me use arrays in Types. There is a bug or two. It's adding the closing tag of a root node to the array when it shouldn't. And the function which can return a tag's inner content doesn't work properly when it includes the inner content of all it's children. I don't have these issues in the DB version so I'm guessing it's something to do with GLB indices being zero-based while DB starts with 1. Or maybe I just copied something over wrong. There's a text xml file here: http://zimnox.com/quiz.xml

Any time you call xmlReadFile() you should call xmlClear() first.

xmlReadFile(string)
xmlGetElementCount()
xmlGetTagName(int)
xmlGetAttirbuteValue$(int, string)
xmlAttributeExists(int, string)
xmlGetAttributeKey$(int, int)
xmlGetAttributeCount(int)
xmlGetTagContent$(int, bool)
xmlClear()

Code (glbasic) Select


// --------------------------------- //
// Project: XMLParser
// Author: Phaelax
// Start: Wednesday, January 26, 2011
// IDE Version: 8.078



TYPE AttributeSet
	key$
	value$
ENDTYPE

TYPE ElementObject
	tagName$
	parentElementId
	content$
	pos
	parentPos
	attributes[0] AS AttributeSet
ENDTYPE

GLOBAL escapes$[]
DIM escapes$[5][2]
escapes$[0][0] = "&lt;"  ; escapes$[0][1] = "<"
escapes$[1][0] = "&gt;"  ; escapes$[1][1] = ">"
escapes$[2][0] = "&amp;" ; escapes$[2][1] = "&"
escapes$[3][0] = "&apos;"; escapes$[3][1] = "'"
escapes$[4][0] = "&quot;"; escapes$[4][1] = CHR$(34)


GLOBAL xmlTags[] AS ElementObject
GLOBAL parseStack[]
DIM xmlTags[0]


//xmlReadFile("c:/quiz.xml")
xmlReadFile("C:/Documents AND Settings/Phaelax.NEWTON64/My Documents/GLBasic/zelda/zelda.gbap")
LOCAL key$
LOCAL y = 0
FOR i = 0 TO xmlGetElementCount()-1
	PRINT i+": "+xmlTags[i].tagName$+" -> "+xmlGetTagContent$(i, FALSE), 50, y;INC y, 10
	FOR j = 0 TO xmlGetAttributeCount(i)-1
		key$ = xmlGetAttributeKey$(i, j)
		PRINT key$ + " -> " + xmlGetAttributeValue$(i, key$), 100, y;INC y, 10
	NEXT
NEXT


SHOWSCREEN
KEYWAIT
END









FUNCTION xmlReadFile:filename$
	LOCAL xmlFileNo = 1
	LOCAL L$, tagName$, c$, oldChar$, temp$, unparsedAttributes$
	LOCAL matchOpenBracket, tagType, strLength, currentTag
	OPENFILE(xmlFileNo, filename$, TRUE)

	WHILE ENDOFFILE(xmlFileNo) = FALSE
		READLINE xmlFileNo, L$
		tagName$ = ""
		matchOpenBracket = -1
		tagType = 0
		strLength = LEN(L$)

		FOR i = 0 TO strLength-1
			c$ = MID$(L$, i, 1)

			//////////////////////////////////////////////
			// open bracket found for new tag
			//////////////////////////////////////////////
			IF c$ = "<"
				matchOpenBracket = i
				tagType = 0
			ENDIF

			//////////////////////////////////////////////
			// forward slash can either be part of a
			// closing container tag, or closing an empty
			//////////////////////////////////////////////
			IF c$ = "/"
				//////////////////////////////////////////////
				// If part of a closing tag, the slash will be
				// prefixed by the bracket (less-than sign)
				//////////////////////////////////////////////
				IF oldChar$ = "<" THEN tagType = 1
			ENDIF

			//////////////////////////////////////////////
			// Closing bracket for a tag
			//////////////////////////////////////////////
			IF c$ = ">"
			
				//////////////////////////////////////////////
				// if character before closing bracket was
				// a slash, then this bracket closed off an
				// empty tag
				//////////////////////////////////////////////
				IF oldChar$ = "/"
					tagType = 2
				ELSE
					//////////////////////////////////////////////
					// "<? ?>" is part of the XML declaration
					//////////////////////////////////////////////
					IF oldChar$ = "?"
						tagType = 2
					ELSE
						//////////////////////////////////////////////
						// Normal close bracket, standard container
						//////////////////////////////////////////////
					ENDIF
				ENDIF
				//////////////////////////////////////////////
				// If we closed off (completed) the opening
				// tag's bracket, then it's open as the current
				// container. Add this tag to the container stack
				// for tracking the hierarchy and store a new
				// tag element in the array
				//////////////////////////////////////////////
				IF tagType = 0

					LOCAL e AS ElementObject
					e.pos = matchOpenBracket
					temp$ = MID$(L$, matchOpenBracket+1, i-matchOpenBracket-1)
					e.tagName$ = TRIM$(UCASE$(LEFT$(temp$, pFindTagNameEndIndex(temp$))))
					pParseXmlAttributes(e, TRIM$(RIGHT$(temp$, LEN(temp$)-LEN(e.tagName$))))
					e.content$ = ""
					//////////////////////////////////////////////
					// A parent ID of -1 means it is the root node
					//////////////////////////////////////////////
					IF LEN(parseStack[]) <= 0
						e.parentElementId = -1
					ELSE
						e.parentElementId = LEN(parseStack[])-1
						//////////////////////////////////////////////
						// The position within the parent tag's content
						// where this tag's data is present
						//////////////////////////////////////////////
						e.parentPos = LEN(xmlTags[e.parentElementId].content$)
					ENDIF
					DIMPUSH xmlTags[], e

					//////////////////////////////////////////////
					// Add the index of the last tag element added
					// to the xmlTags array to the stack. This keeps
					// track of what container we're in
					//////////////////////////////////////////////
					DIMPUSH parseStack[], LEN(xmlTags[])-1
				ENDIF
				//////////////////////////////////////////////
				// Closing tag was found, remove last container
				// from stack
				//////////////////////////////////////////////
				IF tagType = 1
					DIMDEL parseStack[], -1
				ENDIF

				//////////////////////////////////////////////
				// This was an empty tag element. As they are
				// not containers, nothing is added to the stack
				// and nothing needs removed. Create a new
				// element and add it to the xmlTags array.
				//////////////////////////////////////////////
				IF tagType = 2
					LOCAL e AS ElementObject
					
					//////////////////////////////////////////////
					// Checks for special case with XML declaration
					//////////////////////////////////////////////
					IF oldChar$ <> "?"
						temp$ = MID$(L$, matchOpenBracket+1, i-matchOpenBracket-2)
					ELSE
						temp$ = MID$(L$, matchOpenBracket+2, i-matchOpenBracket-3)
					ENDIF
					
					e.tagName$ = TRIM$(UCASE$(LEFT$(temp$, pFindTagNameEndIndex(temp$))))
					pParseXmlAttributes(e, TRIM$(RIGHT$(temp$, LEN(temp$)-LEN(e.tagName$))))
					e.content$ = ""
					IF LEN(parseStack[]) <= 0
						e.parentElementId = -1
					ELSE
						e.parentElementId = LEN(parseStack[])-1
						//////////////////////////////////////////////
						// The position within the parent tag's content
						// where this tag's data begins
						//////////////////////////////////////////////
						e.parentPos = LEN(xmlTags[e.parentElementId].content$)
					ENDIF
					DIMPUSH xmlTags[], e
				ENDIF

				//////////////////////////////////////////////
				// Start the whole process over again, the
				// container has been closed.
				//////////////////////////////////////////////
				matchOpenBracket = -1
				
			ELSE
				IF matchOpenBracket = -1
					LOCAL j = LEN(parseStack[])-1
					currentTag = 0
					IF j >= 0 THEN currentTag = parseStack[j]
					IF currentTag > 0 AND currentTag <= LEN(xmlTags[])
						IF LEN(xmlTags[currentTag].content$) > 0
							xmlTags[currentTag].content$ = xmlTags[currentTag].content$ + c$
						ELSE
							IF ASC(c$) <> 32 AND ASC(c$) <> 9 THEN xmlTags[currentTag].content$ = xmlTags[currentTag].content$ + c$
						ENDIF
					ENDIF
					
				ENDIF
			ENDIF
			//////////////////////////////////////////////
			// Helps keep track of previous characters when
			// checking for forward slashes, which are used
			// to determine the type of tag
			//////////////////////////////////////////////
			oldChar$ = c$
		NEXT
	WEND

	CLOSEFILE xmlFileNo
ENDFUNCTION



FUNCTION xmlClear:
	REDIM xmlTags[0]
ENDFUNCTION



FUNCTION xmlGetElementCount:
	RETURN LEN(xmlTags[])
ENDFUNCTION



FUNCTION xmlGetTagName$:elementId
	RETURN xmlTags[elementId].tagName$
ENDFUNCTION



FUNCTION xmlGetAttributeValue$:elementId, key$
	FOR j = 0 TO xmlGetAttributeCount(elementId)-1
		IF xmlTags[elementId].attributes[j].key$ = key$ THEN RETURN xmlTags[elementId].attributes[j].value$
	NEXT
ENDFUNCTION



FUNCTION xmlAttributeExists:elementId, key$
	FOR j = 0 TO LEN(xmlTags[elementId].attributes[])-1
		IF xmlTags[elementId].attributes[j].key$ = key$ THEN RETURN TRUE
	NEXT
	RETURN FALSE
ENDFUNCTION



FUNCTION xmlGetAttributeKey$:elementId, index
	RETURN xmlTags[elementId].attributes[index].key$
ENDFUNCTION



FUNCTION xmlGetAttributeCount:elementId
	RETURN LEN(xmlTags[elementId].attributes[])
ENDFUNCTION



FUNCTION xmlGetTagContent$:elementId, includeChildren
	LOCAL content$ = xmlTags[elementId].content$
	IF includeChildren = TRUE
		LOCAL extendedLength = 0
		FOR i = 0 TO LEN(xmlTags[])-1
			IF xmlTags[i].parentElementId = elementId
				content$ = pInsertString$(content$, xmlTags[i].content$, xmlTags[i].parentPos + extendedLength)
				extendedLength = extendedLength + LEN(xmlTags[i].content$)
			ENDIF
		NEXT
	ENDIF
	RETURN content$
ENDFUNCTION


FUNCTION pParseXmlAttributes:element AS ElementObject, txt$
	LOCAL s=0, x=0, s1=0, quote=34
	LOCAL key$, value$

	FOR j = 0 TO LEN(txt$)-1
		x = INSTR(txt$, "=", s)
		key$ = UCASE$(TRIM$(MID$(txt$, s, x-s)))
		s = INSTR(txt$, CHR$(34), x)+1
		s1 = INSTR(txt$, CHR$(39), x)+1
		
		quote = 34
		IF s1 > 0
			IF s1 < s OR s < 1
				s = s1
				quote = 39
			ENDIF
		ENDIF
		x = INSTR(txt$, CHR$(quote), s)
		
		value$ = MID$(txt$, s, x-s)
		FOR k = 0 TO BOUNDS(escapes$[], 0)-1
			value$ = REPLACE$(value$, escapes$[k][0], escapes$[k][1])
		NEXT
		
		LOCAL a AS AttributeSet
		a.key$ = key$
		a.value$ = value$
		DIMPUSH element.attributes[], a
		
		s = x+1
		j = x
	NEXT
ENDFUNCTION



FUNCTION pFindTagNameEndIndex:tagLine$
	LOCAL L = LEN(tagLine$)
	FOR i = 0 TO L-1
		IF MID$(tagLine$, i, 1) = " " THEN RETURN i
	NEXT
	RETURN L
ENDFUNCTION



FUNCTION pInsertString$:source$, seg$, pos
	LOCAL t$ = LEFT$(source$, pos)
	source$ = t$ + seg$ + RIGHT$(source$, LEN(source$)-LEN(t$))
	RETURN source$
ENDFUNCTION

Moru · 2011-Jan-28

Lots of comments, nice! My xml-parser is not this complete so I will use yours instead :-)

Kitty Hello · 2011-Jan-28

Can you parse the gpap files (GLBasic project files) with this? That would be... like awesome.

phaelax · 2011-Jan-28

Theoretically it should parse the gbap files since they're xml. Just tested it, but seems I have a bug parsing the attributes for closed tags. I'll work on it some more

Wampus · 2011-Jan-28

Oh! Keep debugging.

This is rather awesome. To be able to parse xml in GLBasic would open up some interesting possibilities.

phaelax · 2011-Jan-28

I got it to parse everything now, as far as I can tell anyway. I'll post the new code here in a minute which will extract all tags/attributes from the project file. I just want to make a correction to the xml declaration tag name, which shows "?xml" instead of just "xml". Also, right now attributes only work with double quotes, not single quotes. I want to fix that too.

Does anyone know if mixing single and double quotes around an attribute value is permitted or do they have to match?
Ex.
something = "puppy'

Well, I found either a bug in the INSTR command or in the help documentation. The help says INSTR returns -1 if the substring isn't found, however, it actually returns 0. Considering 0 could be the first character in the string, I'd say its a bug in the command. Luckily, I can safely assume a quote or double quote will never be the first character on a line.

MrTAToad · 2011-Jan-28

Have you updated your beta copy ? Previously INSTR did have a bug!

I looked at the DBPro code ages ago - didn't know it was you who wrote it!

phaelax · 2011-Jan-29

Well I'm just using the free version of GLB. Right now I have a headache trying to track down one little bug. If you look at a gbap file, the closing tag for GLBASIC is being added as it's inner content and I have no clue why. So basically it says the inner content of GLBASIC is "</GLBASIC" and it doesn't do this on any other tags. I'll update the code above to what I have now.

MrTAToad · 2011-Jan-29

At the moment, I dont think it can handle ?XML at the begining of GLBasic XML project files...

phaelax · 2011-Jan-29

It should now, and should now find the attributes which using either double or single quotes. And I've fixed the bug I just described above. Like I figured, it was an issue with the DBP/GLB conversion where 0 is the start of a string and not 1. Basically, all I had to do was set matchOpenBracket to -1 instead of 0 and check for that condition when adding the content.

Everything should work now, I've already updated the code in the first post. Give it a try with one of your project files.

MrTAToad · 2011-Jan-29

Will do!

Will need to fully examine the output, but it certainly looks correct... Now it just needs to be in an extended TYPE

phaelax · 2011-Jan-29

According to Wikipedia, there are 5 predefined escape entities. I've added checks for those within the attribute parser. Additional entities can be easily added.

One thing I've forgot to consider was comments, which if added to an xml file will lock it up right now. So that'll be my next task.

Kitty Hello · 2011-Feb-07

<drumroll>
Can you parse this file:
http://www.glbasic.com/help/glbasic_e.xml

That would be .. awesome!

No hesitating, though.

phaelax · 2011-Feb-15

It locks up with the glbasic_e file, I'll need to look at it. (I've been outa town for 2 weeks, hence my silence)

Apparently the lines in the file are too long for GLB. The first XENTRY tag in the file is over 21k characters long and it's being broken up internally into several READLINE commands. It's technically breaking the single line up into 21 lines, so I'm thinking GLB has a 1k character limit per READLINE.

You can test this yourself with this snippet and the attached file.

Code (glbasic) Select


OPENFILE(1, "c:/data.txt", TRUE)
LOCAL i = 0
LOCAL L$
WHILE ENDOFFILE(1) = FALSE
	READLINE 1, L$
	PRINT L$, 1, i*10;INC i
WEND
CLOSEFILE 1
PRINT "Line count: "+i, 1, 20+i*10
SHOWSCREEN
KEYWAIT

I did try reading this same example file into DarkBasic, and it just crashes.

[attachment deleted by admin]

MrTAToad · 2011-Feb-15

Yes, I think READLINE is limited to around 1K or so...

News:

xml parser

MrTAToad

MrTAToad

MrTAToad

MrTAToad