inspired by the question of Stu_C about the operation method of SPLITSTR
http://www.glbasic.com/forum/index.php?topic=2384.0
I've tried to implement a modified version of SPLITSTR, which will not ignore multiple splitter without any data between them. As AndyH noted, this could be useful for processing csv files.
The first version of TrueSplitStr, that's how I called it ;), looks like follows:
FUNCTION TrueSplitStr: p_text$, p_split$[], p_delim$
LOCAL l_len_text // length of text$
LOCAL l_ctext // loop counter
LOCAL l_char$ // one character of text$
LOCAL l_item$ // to add to p_split$[]
l_len_text = LEN(p_text$)
DIM p_split$[0] // initialize p_spit$[]
FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1) // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1 // is it a delimiter?
l_item$ = l_item$ + l_char$ // no, so add it to the next item
ELSE
DIMPUSH p_split$[], l_item$ // yes then continue with the next word
l_item$ = ""
ENDIF
NEXT
DIMPUSH p_split$[], l_item$
RETURN LEN(p_split$[]) // return the number of strings
ENDFUNCTION
The doubts about performance mentioned in the thread named above, are not unjustified. This version of SplitStr needs 2-3 times more runtime than the original command. One reason is the DIMPUSH command. Depending on the count of needed array elements it must allocate new memory and must preserve the existing array content.
But if I can calculate the number of needed array elements, I'm able to create the array with only one DIM command and fill the array within the FOR-NEXT-loop.
This coding is bigger but much faster than the first version:
FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$
LOCAL l_len_text // length of text$
LOCAL l_ctext // loop counter
LOCAL l_char$ // one character of text$
LOCAL l_item$ // to add to p_split$[]
LOCAL l_count // number of entries needed for p_split$[]
LOCAL l_idx // current index of p_split$[]
l_len_text = LEN(p_text$)
// first get the needed number of entries for p_split$[]
FOR l_ctext = 0 TO l_len_text - 1
IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
INC l_count, 1
ENDIF
NEXT
REDIM p_split$[l_count + 1]
// now do the split
FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1) // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1 // is it a delimiter?
l_item$ = l_item$ + l_char$ // no, so add it to the next item
ELSE
p_split$[l_idx] = l_item$ // yes then continue with the next word
l_item$ = ""
INC l_idx, 1
ENDIF
NEXT
p_split$[l_idx] = l_item$
RETURN LEN(p_split$[]) // return the number of strings
ENDFUNCTION
First the size of the array is calculated. Then it's filled.
The following coding shows all three versions of SplitStr
- SPLITSTR command of GLBasic
- slow version of TrueSplitStr
- fast version of TrueSplitStr
and shows the needed time. As an example I've attached a csv file (example.txt), which is composed by 1000 rows and 10 columns. The commented function debugshow shows the results in the debug window (activate debug mode first).
The results on my computer were: The slow variant fo TrueSplitStr was 2-3 times slower than the SPLITSTR command, the fast variant about 50% slower.
LOCAL text$[]
LOCAL split$[]
LOCAL line$
// read a sample csv file
OPENFILE(1, "example.csv", TRUE)
WHILE ENDOFFILE(1) = FALSE
READLINE 1, line$
DIMPUSH text$[], line$
WEND
LOCAL time_start, time_end
LOCAL time_splitstr, time_splitstrslow, time_splitstrfast
LOCAL i, total, count
total = LEN(text$[])
// first test with original SPLITSTR command
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
count = SPLITSTR(text$[i], split$[], ",")
//debugshow(split$[]) // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstr = time_end - time_start
// now the "slow" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
count = TrueSplitStr_Slow(text$[i], split$[], ",")
//debugshow(split$[]) // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrslow = time_end - time_start
// and finally the "fast" variant of TrueSplitStr
time_start = GETTIMERALL()
FOR i = 0 TO total - 1
count = TrueSplitStr_Fast(text$[i], split$[], ",")
//debugshow(split$[]) // activate debug mode to see the results in debug window
NEXT
time_end = GETTIMERALL()
time_splitstrfast = time_end - time_start
PRINT "Original SPLITSTR command :" + time_splitstr, 0, 0
PRINT "Slow TrueSplitStr function :" + time_splitstrslow, 0, 30
PRINT "Fast TrueSplitStr function :" + time_splitstrfast, 0, 60
SHOWSCREEN
KEYWAIT
FUNCTION TrueSplitStr_Slow: p_text$, p_split$[], p_delim$
LOCAL l_len_text // length of text$
LOCAL l_ctext // loop counter
LOCAL l_char$ // one character of text$
LOCAL l_item$ // to add to p_split$[]
l_len_text = LEN(p_text$)
DIM p_split$[0] // initialize p_spit$[]
FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1) // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1 // is it a delimiter?
l_item$ = l_item$ + l_char$ // no, so add it to the next item
ELSE
DIMPUSH p_split$[], l_item$ // yes then continue with the next word
l_item$ = ""
ENDIF
NEXT
DIMPUSH p_split$[], l_item$
RETURN LEN(p_split$[]) // return the number of strings
ENDFUNCTION
FUNCTION TrueSplitStr_Fast: p_text$, p_split$[], p_delim$
LOCAL l_len_text // length of text$
LOCAL l_ctext // loop counter
LOCAL l_char$ // one character of text$
LOCAL l_item$ // to add to p_split$[]
LOCAL l_count // number of entries needed for p_split$[]
LOCAL l_idx // current index of p_split$[]
l_len_text = LEN(p_text$)
// first get the needed number of entries for p_split$[]
FOR l_ctext = 0 TO l_len_text - 1
IF INSTR(p_delim$, MID$(p_text$, l_ctext, 1)) > -1
INC l_count, 1
ENDIF
NEXT
REDIM p_split$[l_count + 1]
// now do the split
FOR l_ctext = 0 TO l_len_text - 1
l_char$ = MID$(p_text$, l_ctext, 1) // get the next character of text$
IF INSTR(p_delim$, l_char$) = -1 // is it a delimiter?
l_item$ = l_item$ + l_char$ // no, so add it to the next item
ELSE
p_split$[l_idx] = l_item$ // yes then continue with the next word
l_item$ = ""
INC l_idx, 1
ENDIF
NEXT
p_split$[l_idx] = l_item$
RETURN LEN(p_split$[]) // return the number of strings
ENDFUNCTION
FUNCTION debugshow: p_split$[]
LOCAL i
FOR i = 0 TO LEN(p_split$[]) - 1
DEBUG p_split$[i] + "\t"
NEXT
DEBUG "\n"
ENDFUNCTION
[attachment deleted by admin]
bottleneck is the MID$. If you want it fast, do inline with const char*'s.
Nice finally someone noticed the speed of SLPITSTR - was a lot of work ;)
yes I tried to do it with INLINE ... and it worked after several hours of testing ;), but speed was not much better than the pure GLBasic version, so I discard this solution.
The results on my pc for 1000 lines csv file:
- SPLITSTR: 33 ms
- TrueSplitStr (slow): 66 ms
- TrueSplitStr (fast): 46 ms
not bad I think ;)
Normally a function like SPLITSTR is not used in a loop but in front of it for initialization works, so the speed should be sufficient anyway.