Print Page - TrueSplitStr - and a word about performance

Title: TrueSplitStr - and a word about performance
Post by: Quentin on 2008-Oct-10

inspired by the question of Stu_C about the operation method of SPLITSTR

http://www.glbasic.com/forum/index.php?topic=2384.0

I've tried to implement a modified version of SPLITSTR, which will not ignore multiple splitter without any data between them. As AndyH noted, this could be useful for processing csv files.

The first version of TrueSplitStr, that's how I called it ;), looks like follows:

Code (glbasic) Select


FUNCTION TrueSplitStr: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	
	l_len_text = LEN(p_text$)
	
	DIM p_split$[0]                           // initialize p_spit$[]
	
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			DIMPUSH p_split$[], l_item$       // yes then continue with the next word
			l_item$ = ""
		ENDIF
	NEXT
	DIMPUSH p_split$[], l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION

The doubts about performance mentioned in the thread named above, are not unjustified. This version of SplitStr needs 2-3 times more runtime than the original command. One reason is the DIMPUSH command. Depending on the count of needed array elements it must allocate new memory and must preserve the existing array content.

But if I can calculate the number of needed array elements, I'm able to create the array with only one DIM command and fill the array within the FOR-NEXT-loop.

This coding is bigger but much faster than the first version:

Code (glbasic) Select


FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	LOCAL l_count                             // number of entries needed for p_split$[]
	LOCAL l_idx                               // current index of p_split$[]
	
	l_len_text = LEN(p_text$)
	
	// first get the needed number of entries for p_split$[]
	FOR l_ctext = 0 TO l_len_text - 1
		IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
			INC l_count, 1
		ENDIF
	NEXT
	REDIM p_split$[l_count + 1]
	
	// now do the split
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			p_split$[l_idx] = l_item$         // yes then continue with the next word
			l_item$ = ""
			INC l_idx, 1
		ENDIF
	NEXT
	p_split$[l_idx] = l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION

First the size of the array is calculated. Then it's filled.

The following coding shows all three versions of SplitStr

- SPLITSTR command of GLBasic
- slow version of TrueSplitStr
- fast version of TrueSplitStr

and shows the needed time. As an example I've attached a csv file (example.txt), which is composed by 1000 rows and 10 columns. The commented function debugshow shows the results in the debug window (activate debug mode first).

The results on my computer were: The slow variant fo TrueSplitStr was 2-3 times slower than the SPLITSTR command, the fast variant about 50% slower.

Code (glbasic) Select


LOCAL text$[]
LOCAL split$[]
LOCAL line$

// read a sample csv file
OPENFILE(1, "example.csv", TRUE)
WHILE ENDOFFILE(1) = FALSE
	READLINE 1, line$
	DIMPUSH text$[], line$
WEND

LOCAL time_start, time_end
LOCAL time_splitstr, time_splitstrslow, time_splitstrfast
LOCAL i, total, count

total = LEN(text$[])

// first test with original SPLITSTR command
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
	count = SPLITSTR(text$[i], split$[], ",")
	//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstr = time_end - time_start

// now the "slow" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
	count = TrueSplitStr_Slow(text$[i], split$[], ",")
	//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrslow = time_end - time_start

// and finally the "fast" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
	count = TrueSplitStr_Fast(text$[i], split$[], ",")
	//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrfast = time_end - time_start

PRINT "Original SPLITSTR command     :" + time_splitstr, 0, 0
PRINT "Slow TrueSplitStr function    :" + time_splitstrslow, 0, 30
PRINT "Fast TrueSplitStr function    :" + time_splitstrfast, 0, 60
SHOWSCREEN
KEYWAIT


FUNCTION TrueSplitStr_Slow: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	
	l_len_text = LEN(p_text$)
	
	DIM p_split$[0]                           // initialize p_spit$[]
	
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			DIMPUSH p_split$[], l_item$       // yes then continue with the next word
			l_item$ = ""
		ENDIF
	NEXT
	DIMPUSH p_split$[], l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION


FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$

	LOCAL l_len_text                          // length of text$
	LOCAL l_ctext                             // loop counter
	LOCAL l_char$                             // one character of text$
	LOCAL l_item$                             // to add to p_split$[]
	LOCAL l_count                             // number of entries needed for p_split$[]
	LOCAL l_idx                               // current index of p_split$[]
	
	l_len_text = LEN(p_text$)
	
	// first get the needed number of entries for p_split$[]
	FOR l_ctext = 0 TO l_len_text - 1
		IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
			INC l_count, 1
		ENDIF
	NEXT
	REDIM p_split$[l_count + 1]
	
	// now do the split
	FOR l_ctext = 0 TO l_len_text - 1
		l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
		IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
			l_item$ = l_item$ + l_char$       // no, so add it to the next item
		ELSE
			p_split$[l_idx] = l_item$         // yes then continue with the next word
			l_item$ = ""
			INC l_idx, 1
		ENDIF
	NEXT
	p_split$[l_idx] = l_item$

	RETURN LEN(p_split$[])                    // return the number of strings 
	
ENDFUNCTION


FUNCTION debugshow: p_split$[]
	LOCAL i
	FOR i = 0 TO LEN(p_split$[]) - 1
		DEBUG p_split$[i] + "\t"
	NEXT
	DEBUG "\n"
ENDFUNCTION

[attachment deleted by admin]

Title: Re: TrueSplitStr - and a word about performance
Post by: Kitty Hello on 2008-Oct-11

bottleneck is the MID$. If you want it fast, do inline with const char*'s.
Nice finally someone noticed the speed of SLPITSTR - was a lot of work ;)

Title: Re: TrueSplitStr - and a word about performance
Post by: Quentin on 2008-Oct-11

yes I tried to do it with INLINE ... and it worked after several hours of testing ;), but speed was not much better than the pure GLBasic version, so I discard this solution.

The results on my pc for 1000 lines csv file:
- SPLITSTR: 33 ms
- TrueSplitStr (slow): 66 ms
- TrueSplitStr (fast): 46 ms

not bad I think ;)

Normally a function like SPLITSTR is not used in a loop but in front of it for initialization works, so the speed should be sufficient anyway.

GLBasic forum

Codesnippets => Code Snippets => Topic started by: Quentin on 2008-Oct-10