TrueSplitStr - and a word about performance

Previous topic - Next topic

Quentin

inspired by the question of Stu_C about the operation method of SPLITSTR

http://www.glbasic.com/forum/index.php?topic=2384.0

I've tried to implement a modified version of SPLITSTR, which will not ignore multiple splitter without any data between them. As AndyH noted, this could be useful for processing csv files.

The first version of TrueSplitStr, that's how I called it ;), looks like follows:

Code (glbasic) Select

FUNCTION TrueSplitStr: p_text$, p_split$[], p_delim$

LOCAL l_len_text                          // length of text$
LOCAL l_ctext                             // loop counter
LOCAL l_char$                             // one character of text$
LOCAL l_item$                             // to add to p_split$[]

l_len_text = LEN(p_text$)

DIM p_split$[0]                           // initialize p_spit$[]

FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
l_item$ = l_item$ + l_char$       // no, so add it to the next item
ELSE
DIMPUSH p_split$[], l_item$       // yes then continue with the next word
l_item$ = ""
ENDIF
NEXT
DIMPUSH p_split$[], l_item$

RETURN LEN(p_split$[])                    // return the number of strings

ENDFUNCTION


The doubts about performance mentioned in the thread named above, are not unjustified. This version of SplitStr needs 2-3 times more runtime than the original command. One reason is the DIMPUSH command. Depending on the count of needed array elements it must allocate new memory and must preserve the existing array content.

But if I can calculate the number of needed array elements, I'm able to create the array with only one DIM command and fill the array within the FOR-NEXT-loop.

This coding is bigger but much faster than the first version:

Code (glbasic) Select

FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$

LOCAL l_len_text                          // length of text$
LOCAL l_ctext                             // loop counter
LOCAL l_char$                             // one character of text$
LOCAL l_item$                             // to add to p_split$[]
LOCAL l_count                             // number of entries needed for p_split$[]
LOCAL l_idx                               // current index of p_split$[]

l_len_text = LEN(p_text$)

// first get the needed number of entries for p_split$[]
FOR l_ctext = 0 TO l_len_text - 1
IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
INC l_count, 1
ENDIF
NEXT
REDIM p_split$[l_count + 1]

// now do the split
FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
l_item$ = l_item$ + l_char$       // no, so add it to the next item
ELSE
p_split$[l_idx] = l_item$         // yes then continue with the next word
l_item$ = ""
INC l_idx, 1
ENDIF
NEXT
p_split$[l_idx] = l_item$

RETURN LEN(p_split$[])                    // return the number of strings

ENDFUNCTION


First the size of the array is calculated. Then it's filled.

The following coding shows all three versions of SplitStr

- SPLITSTR command of GLBasic
- slow version of TrueSplitStr
- fast version of TrueSplitStr

and shows the needed time. As an example I've attached a csv file (example.txt), which is composed by 1000 rows and 10 columns. The commented function debugshow shows the results in the debug window (activate debug mode first).

The results on my computer were: The slow variant fo TrueSplitStr was 2-3 times slower than the SPLITSTR command, the fast variant about 50% slower.

Code (glbasic) Select

LOCAL text$[]
LOCAL split$[]
LOCAL line$

// read a sample csv file
OPENFILE(1, "example.csv", TRUE)
WHILE ENDOFFILE(1) = FALSE
READLINE 1, line$
DIMPUSH text$[], line$
WEND

LOCAL time_start, time_end
LOCAL time_splitstr, time_splitstrslow, time_splitstrfast
LOCAL i, total, count

total = LEN(text$[])

// first test with original SPLITSTR command
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
count = SPLITSTR(text$[i], split$[], ",")
//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstr = time_end - time_start

// now the "slow" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
count = TrueSplitStr_Slow(text$[i], split$[], ",")
//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrslow = time_end - time_start

// and finally the "fast" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
count = TrueSplitStr_Fast(text$[i], split$[], ",")
//debugshow(split$[])                     // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrfast = time_end - time_start

PRINT "Original SPLITSTR command     :" + time_splitstr, 0, 0
PRINT "Slow TrueSplitStr function    :" + time_splitstrslow, 0, 30
PRINT "Fast TrueSplitStr function    :" + time_splitstrfast, 0, 60
SHOWSCREEN
KEYWAIT


FUNCTION TrueSplitStr_Slow: p_text$, p_split$[], p_delim$

LOCAL l_len_text                          // length of text$
LOCAL l_ctext                             // loop counter
LOCAL l_char$                             // one character of text$
LOCAL l_item$                             // to add to p_split$[]

l_len_text = LEN(p_text$)

DIM p_split$[0]                           // initialize p_spit$[]

FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
l_item$ = l_item$ + l_char$       // no, so add it to the next item
ELSE
DIMPUSH p_split$[], l_item$       // yes then continue with the next word
l_item$ = ""
ENDIF
NEXT
DIMPUSH p_split$[], l_item$

RETURN LEN(p_split$[])                    // return the number of strings

ENDFUNCTION


FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$

LOCAL l_len_text                          // length of text$
LOCAL l_ctext                             // loop counter
LOCAL l_char$                             // one character of text$
LOCAL l_item$                             // to add to p_split$[]
LOCAL l_count                             // number of entries needed for p_split$[]
LOCAL l_idx                               // current index of p_split$[]

l_len_text = LEN(p_text$)

// first get the needed number of entries for p_split$[]
FOR l_ctext = 0 TO l_len_text - 1
IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
INC l_count, 1
ENDIF
NEXT
REDIM p_split$[l_count + 1]

// now do the split
FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1)   // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1      // is it a delimiter?
l_item$ = l_item$ + l_char$       // no, so add it to the next item
ELSE
p_split$[l_idx] = l_item$         // yes then continue with the next word
l_item$ = ""
INC l_idx, 1
ENDIF
NEXT
p_split$[l_idx] = l_item$

RETURN LEN(p_split$[])                    // return the number of strings

ENDFUNCTION


FUNCTION debugshow: p_split$[]
LOCAL i
FOR i = 0 TO LEN(p_split$[]) - 1
DEBUG p_split$[i] + "\t"
NEXT
DEBUG "\n"
ENDFUNCTION



[attachment deleted by admin]

Kitty Hello

bottleneck is the MID$. If you want it fast, do inline with const char*'s.
Nice finally someone noticed the speed of SLPITSTR - was a lot of work ;)

Quentin

yes I tried to do it with INLINE ... and it worked after several hours of testing ;), but speed was not much better than the pure GLBasic version, so I discard this solution.

The results on my pc for 1000 lines csv file:
- SPLITSTR:  33 ms
- TrueSplitStr (slow): 66 ms
- TrueSplitStr (fast): 46 ms

not bad I think ;)

Normally a function like SPLITSTR is not used in a loop but in front of it for initialization works, so the speed should be sufficient anyway.