Page 1 of 2
PluckString command to extract string from another
Posted: Thu May 19, 2011 1:31 pm
by MachineCode
Would love a command that would return a string from another string, where the user specifies a first "border" and last "border". So, the command p$=PluckString("abc123def","c","d") would return "123" for a$, because "123" is between "c" and "d". Would make parsing HTML elements so easy!
Currently we have to do it with FOUR commands: three FindString() and one Mid(), and even the following example is buggy because "d" might be found before "c"!
Code: Select all
s$="abc123def"
p$=Mid(s$,FindString(s$,"c",1)+1,FindString(s$,"d",1)-FindString(s$,"c",1)-1)
Debug p$ ; Returns 123
EDIT! Just to clarify: the border strings should be strings, not just single characters!

Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 2:18 pm
by TomS
MachineCode wrote:the following example is buggy because "d" might be found before "c"!
Thanks for mentioning this. I forgot about that. That would have caused a neat little IMA in my Code^^
EDIT1: Updated to handle a situation, where the delimiter should be ignored (inside Doublequotes). Thanks to Trond:
http://purebasic.fr/english/viewtopic.p ... 07#p353107
Code: Select all
Procedure Extract(input.s, del1.s, del2.s, Array output.s(1), ignoreStrings.i=0)
Protected *c.Character = @input ; Pointer with the size of a char at the adress of the input string ( = first char in string)
Protected del1_asc.i = Asc(del1) ; Get Asciicode for a faster comparison
Protected del2_asc.i = Asc(del2) ; ^^ dito ^^
Protected sameDel.i = #False ; A flag that is set true, if the delimiters are the same
Protected state.i ; current state: Inside or Outside of the StringToExtract
Protected state_inStringDQ.i ; inside/outside a DQ-String "string"
Protected *area_begin, *area_end ; MemoryPointers to the Begin and End of the StringToExtract
Protected arrayIndex.i = 0
Protected PeekSFormat.i ; StringFormat (#PB_UTF8, #PB_Ascii)
Select SizeOf(Character)
Case 1
PeekSFormat = #PB_Ascii
Case 2
PeekSFormat = #PB_UTF8
EndSelect
If del1 = del2
sameDel = #True
EndIf
While *c\c ! 0 ; While EndOfString <> 0
Select *c\c
Case '"'
If ignoreStrings = #True
state_inStringDQ = Abs(state_inStringDQ - 1)
EndIf
Case del1_asc
If state_inStringDQ <> #True
If sameDel = #True
state = Abs(state - 1) ; Alternate inside/outside state
Else
state=1
EndIf
Select state
Case 1
*area_begin = *c
Case 0
*area_end = *c
If *area_begin>0
output(arrayIndex) = PeekS(*area_begin+SizeOf(Character), *area_end - *area_begin-1, PeekSFormat)
arrayIndex+1
ReDim output(ArraySize(output())+1)
EndIf
EndSelect
EndIf
Case del2_asc
If state_inStringDQ <> #True
*area_end = *c
If *area_begin>0
output(arrayIndex) = PeekS(*area_begin+SizeOf(Character), *area_end - *area_begin-1, PeekSFormat)
arrayIndex+1
ReDim output(ArraySize(output())+1)
EndIf
EndIf
EndSelect
*c + SizeOf(Character) ; Next Character
Wend
EndProcedure
;- Example
Dim myOutput.s(1) ;Create an Array that contains all findings
text.s="bla <result = "+Chr(34)+"x>3"+Chr(34)+"> blub" ;our text.
d.s="<" ;our first "border" (entering StringToExtract)
e.s=">" ;our second "border" (leaving StringToExtract)
Extract(text,d,e, myOutput(), #True)
For a=0 To ArraySize(myOutput())-2 ;Let's see what we have here.
Debug myOutput(a)
Next
;- Another Example
ReDim myOutput.s(1)
text.s="This will ?extract? strings that are framed by a ?single? character" ;our text.
d.s="?" ;our first "border" (entering StringToExtract)
e.s="?" ;our second "border" (leaving StringToExtract)
Extract(text,d,e, myOutput())
For a=0 To ArraySize(myOutput())-2 ;Let's see what we have here.
Debug myOutput(a)
Next
End
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 2:26 pm
by MachineCode
Your long code sample is precisely why we need a nice little native command to do it.

Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 2:36 pm
by TomS
Sure.
But until then, I think it's better than your FindString approach.
And hopefully faster (or else learning this memory stuff was all for nothing

)
I think Fred should concentrate on things that can't be achieved with PB only (e.g things that are possible using the windows api but don't work on linux, like the custom cursorhandles for the CanvasGadget^^) so that we all can continue writing code that is easily portable to other operating sytems.
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 2:43 pm
by Shardik
MachineCode wrote:Would love a command that would return a string from another string, where the user specifies a first "border" and last "border".
When using StringField() you only need 2 instructions:
Code: Select all
p$ = StringField(StringField(s$, 2, "c"), 1, "d")
And with StringField() you may even use more than one character for the "borders"...

Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 3:13 pm
by MachineCode
Code: Select all
s$="blah <a href='hi'>test</a> blah"
p$ = StringField(StringField(s$,2,"<a href='hi'>"),1,"</a>")
Debug p$ ; Fails to return "test".
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 3:22 pm
by TomS
Yes. Because the help says that the lenght of the delimiter can only be 1.
My code could be adapted so it works with longer delimiters, too.
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 3:28 pm
by Shardik
TomS wrote:Yes. Because the help says that the lenght of the delimiter can only be 1.
Sorry, I am getting old. TomS is right...
Nevertheless to obtain "test" you could try it even with one "border" character:
Code: Select all
s$="blah <a href='hi'>test</a> blah"
p$ = StringField(StringField(s$,2,">"),1,"<")
Debug p$ ; Doesn't fail to return "test".
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 4:17 pm
by Trond
Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:
Code: Select all
<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 4:31 pm
by TomS
I updated my code to handle this situation (delimiters within doublequotes* are ignored).
*easily expandable to single quotes and more...
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 5:52 pm
by skywalk
Good Topic...
I modified my code and noticed the FindString behavior changed with 4.6.
Code: Select all
Debug FindString("text","x") ; Normal behavior
Debug FindString("text","x", 0) ; Used to cause an error
Debug FindString("text","x",-1) ; Used to cause an error
Here is what I use for "between text" searches. I'm sure it's not the fastest.
Code: Select all
CompilerIf #PB_Compiler_Unicode
Macro MidF(inString, StartPos, Length=-1)
PeekS(@inString + ((StartPos - 1) * SizeOf(Character)), Length, #PB_Unicode)
EndMacro
CompilerElse
Macro MidF(inString, StartPos, Length=-1)
PeekS(@inString + ((StartPos - 1) * SizeOf(Character)), Length, #PB_Ascii)
EndMacro
CompilerEndIf
Procedure.s SF_Between(sSearchIn.s, sFrom.s, sTo.s, *PosAfter.integer=0, StartPos.i=1)
; REV: 110519, skywalk
; PB 4.6 FindString() added default StartPos and accepts <=0 entries without error.
; Returns a string between 2 multi-char delimiters
; Parameters:
; sSearchIn: String to search
; sFrom: 1st keyword
; sTo: 2nd keyword
; PosAfter: Text position after found String
; StartPos: Position to start search
; Syntax:
; Debug SF_Between("<html>some text here</html>", "<html>", "</html>")
; Debug SF_Between(r, "some", " ", @PosAfter, StartPos)
Protected.i nLen1,nLen2,nLen,nLen3
Protected.s sFound
nLen1 = FindString(sSearchIn, sFrom, StartPos)
If nLen1
nLen2 = FindString(sSearchIn, sTo, nLen1 + Len(sFrom))
If nLen2
nLen = nLen1 + Len(sFrom)
nLen3 = nLen2 - nLen
sFound = MidF(sSearchIn, nLen, nLen3)
If (nLen + nLen3 > 0) And *PosAfter ; Avoid Null Pointer error
*PosAfter\i = nLen2 ; used to be -> nLen
EndIf
EndIf
EndIf
ProcedureReturn sFound
EndProcedure
Define.s r = "<html>some text here</html> <html> and again </html>"
Define.i PosAfter
Debug "1-> " + SF_Between(r, "<html>", "</html>")
Debug "2-> " + SF_Between(r, "<html>", "")
Debug "3-> " + SF_Between(r, " ", " ", @PosAfter, 20)
Debug "4-> " + SF_Between(r, "some", " ", @PosAfter)
Debug "5-> " + "PosAfter = " + Str(PosAfter) + " = " + #DQUOTE$ + midf(r,PosAfter) + #DQUOTE$
Debug "6-> " + SF_Between(r, "some", " ", @PosAfter, PosAfter)
Debug "7-> " + SF_Between(r, "some", "<", @PosAfter, -1)
Debug "8-> " + "PosAfter = " + Str(PosAfter) + " = " + #DQUOTE$ + midf(r,PosAfter) + #DQUOTE$
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 6:16 pm
by Little John
The following
simple code does the job:
Code: Select all
Procedure.s PluckString (main$, before$, after$, *posAfterAfter.Integer, start.i=1)
Protected.i left, right
If start > 0
left = FindString(main$, before$, start)
If left > 0
left + Len(before$)
right = FindString(main$, after$, left)
If right > 0
*posAfterAfter\i = right + Len(after$)
ProcedureReturn Mid(main$, left, right-left)
EndIf
EndIf
EndIf
*posAfterAfter\i = 0
ProcedureReturn ""
EndProcedure
; -- Demo
Define nextStart.i
Debug "*" + PluckString("<a href='xyz'>name</a>", "<a href='xyz'>", "</a>", @nextStart) + "*"
Debug nextStart
Debug "*" + PluckString("<a href='xyz'>name</a>", ">", "<", @nextStart) + "*"
Debug nextStart
Trond wrote:Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:
Code: Select all
<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Are you sure that that's valid HTML code? Shouldn't it be
Code: Select all
<element attribute="fhasdf>lsdgf">
?
Regards, Little John
Re: PluckString command to extract string from another
Posted: Thu May 19, 2011 9:40 pm
by Trond
Little John wrote:
Trond wrote:Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:
Code: Select all
<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Are you sure that that's valid HTML code? Shouldn't it be
Code: Select all
<element attribute="fhasdf>lsdgf">
?
Regards, Little John
I'm quite sure it's valid. > is not needed because it's inside the double quotes.
Re: PluckString command to extract string from another
Posted: Fri May 20, 2011 2:05 pm
by MachineCode
Trond wrote:<element attribute="fhasdf>lsdgf">
Not an issue at all, because the right "border" would be Chr(34)+">", not just ">" alone.
Anyway, here's what I've been using, which works great. Just wish a single command could do it.
Code: Select all
Procedure.s PluckString(text$,lborder$,rborder$)
l=FindString(text$,lborder$,1)
r=FindString(text$,rborder$,l+1)
If l And r
s=l+Len(lborder$)
p$=Mid(text$,s,r-s)
EndIf
ProcedureReturn p$
EndProcedure
n$=Str(Random(9999))
Debug "Plucking: "+n$
Debug PluckString("c"+n$+"d","c","d")
Debug PluckString("abc"+n$+"def","c","d")
Debug PluckString("dc"+n$+"dc","c","d")
Debug PluckString("<a href='hi'>"+n$+"</a>","<a href='hi'>","</a>")
Debug PluckString("<attribute='"+n$+"'>","<attribute='","'>")
Re: PluckString command to extract string from another
Posted: Fri May 20, 2011 2:42 pm
by Trond
MachineCode wrote:Trond wrote:<element attribute="fhasdf>lsdgf">
Not an issue at all, because the right "border" would be Chr(34)+">", not just ">" alone.
Now that's just making it worse. Consider:
Space between quote and >:
HTML5 has many new attributes without parameters:
Code: Select all
<audio id="asdfrq>we" src="a.avi" loop>
HTML allows single quotes around around parameters:
Code: Select all
<element name="asdfwe>ferf" id='squote'>
HTML allows omitting quotes around the parameter, as long as it contains only letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46):
Code: Select all
<element name="owe>fi" src=hi.html>