TrueSplitStr - und ein Wort über Performance

Quentin · 2008-Oct-10

inspiriert durch die Frage von Stu_C nach der Arbeitsweise von SPLITSTR

http://www.glbasic.com/forum/index.php?topic=2384.0

habe ich mich an eine modifizierte Version von SPLITSTR gewagt, die doppelte Trennzeichen nicht ignoriert, sondern auch in den Splitprozess miteinbezieht. Wie AndyH richtig bemerkte, kann das z.B. für die Verarbeitung von CSV-Dateien sinnvoll sein.

Eine erste Version von TrueSplitStr, wie ich die Funktion anmaßenderweise mal genannt habe, sieht folgendermaßen aus:

Code (glbasic) Select


FUNCTION TrueSplitStr: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	
	l_len_text = LEN(p_text$)
	
	DIM p_split$[0]                           // initialize p_spit$[]
	
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			DIMPUSH p_split$[], l_item$       // yes then continue with the next word
			l_item$ = ""
		ENDIF
	NEXT
	DIMPUSH p_split$[], l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION

Die Bedenken zur Performance, die im obigen Thread auch genannt wurden, sind nicht ganz unberechtigt. Diese Version von SplitStr benötigt etwa 2-3 mal so viel Zeit wie das Original. Als leichte Bremse erweist sich der DIMPUSH-Befehl, der je nach Anzahl der Array-Elemente immer mal wieder neuen Speicher reservieren und dabei auch die bisherigen Array-Elemente erhalten (sprich umschaufeln) muss.

Wenn man jedoch im Vorfeld die benötigte Anzahl der Arrayelemente schon berechnet, kann man das Array mit einem einzigen DIM-Befehl erzeugen und dann innerhalb der FOR-NEXT-Schleife füllen.

Das Coding ist zwar etwas größer, aber auch deutlich schneller als die erste Version:

Code (glbasic) Select


FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	LOCAL l_count                             // number of entries needed for p_split$[]
	LOCAL l_idx                               // current index of p_split$[]
	
	l_len_text = LEN(p_text$)
	
	// first get the needed number of entries for p_split$[]
	FOR l_ctext = 0 TO l_len_text - 1
		IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
			INC l_count, 1
		ENDIF
	NEXT
	REDIM p_split$[l_count + 1]
	
	// now do the split
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			p_split$[l_idx] = l_item$         // yes then continue with the next word
			l_item$ = ""
			INC l_idx, 1
		ENDIF
	NEXT
	p_split$[l_idx] = l_item$ 

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION

Hier wird zunächst die erforderliche Größe des Arrays errechnet und anschließend das erzeugte Array gefüllt.

Das folgende Coding stellt alle drei Versionen von SplitStr, also

- SPLITSTR - Befehl von GLBasic
- langsame Version von TrueSplitStr
- schnelle Version von TrueSplitStr

gegenüber und gibt die verbrauchten Zeiten an. Als Beispiel habe ich eine CSV-Datei beigefügt, die aus 1000 Zeilen mal 10 Spalten besteht. Die hier auskommentierte Funktion debugshow kann bei aktiviertem Debug-Modus dazu genutzt werden, die Ergebnisse im Debug-Fenster auszugeben.

Die Ergebnisse bei mir waren in etwa so, daß die langsame Variante von TrueSplitStr etwa 2-3 mal langsamer war als der SPLITSTR-Befehl, die schnellere Variante etwa 50% langsamer.

Code (glbasic) Select




LOCAL text$[]
LOCAL split$[]
LOCAL line$

// read a sample csv file
OPENFILE(1, "example.txt", TRUE)
WHILE ENDOFFILE(1) = FALSE
	READLINE 1, line$
	DIMPUSH text$[], line$
WEND

LOCAL time_start, time_end
LOCAL time_splitstr, time_splitstrslow, time_splitstrfast
LOCAL i, total, count

total = LEN(text$[])

// first test with original SPLITSTR command
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
	count = SPLITSTR(text$[i], split$[], ",")
	//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstr = time_end - time_start

// now the "slow" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
	count = TrueSplitStr_Slow(text$[i], split$[], ",")
	//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrslow = time_end - time_start

// and finally the "fast" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
	count = TrueSplitStr_Fast(text$[i], split$[], ",")
	//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrfast = time_end - time_start

PRINT "Original SPLITSTR command     :" + time_splitstr, 0, 0
PRINT "Slow TrueSplitStr function    :" + time_splitstrslow, 0, 30
PRINT "Fast TrueSplitStr function    :" + time_splitstrfast, 0, 60
SHOWSCREEN
KEYWAIT


FUNCTION TrueSplitStr_Slow: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	
	l_len_text = LEN(p_text$)
	
	DIM p_split$[0]                           // initialize p_spit$[]
	
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			DIMPUSH p_split$[], l_item$       // yes then continue with the next word
			l_item$ = ""
		ENDIF
	NEXT
	DIMPUSH p_split$[], l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION


FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	LOCAL l_count                             // number of entries needed for p_split$[]
	LOCAL l_idx                               // current index of p_split$[]
	
	l_len_text = LEN(p_text$)
	
	// first get the needed number of entries for p_split$[]
	FOR l_ctext = 0 TO l_len_text - 1
		IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
			INC l_count, 1
		ENDIF
	NEXT
	REDIM p_split$[l_count + 1]
	
	// now do the split
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			p_split$[l_idx] = l_item$         // yes then continue with the next word
			l_item$ = ""
			INC l_idx, 1
		ENDIF
	NEXT
	p_split$[l_idx] = l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION


FUNCTION debugshow: p_split$[]
	LOCAL i
	FOR i = 0 TO LEN(p_split$[]) - 1
		DEBUG p_split$[i] + "\t"
	NEXT
	DEBUG "\n"
ENDFUNCTION

[attachment deleted by admin]

Schranz0r · 2008-Oct-11

Gernot hat mal eine Bankfunktion gemacht...
Ob man die da gebrauchen kann, ums evtl. schneller zu machen?!

Quentin · 2008-Oct-11

glaube ich kaum Schranzor, die schnellere Variante von TrueSplitStr benötigt für die 1000 Zeilen auf meinem Rechner gerade mal 46 ms, SPLITSTR original so um die 33 ms. Das ist jetzt wirklich nichts, was furchtbar optimierungsbedürftig wäre, denke ich.

Schranz0r · 2008-Oct-12

War nur ne Idee

News:

TrueSplitStr - und ein Wort über Performance

Quentin

Schranz0r

Quentin

Schranz0r