Page 1 of 2

PluckString command to extract string from another

Posted: Thu May 19, 2011 1:31 pm
by MachineCode
Would love a command that would return a string from another string, where the user specifies a first "border" and last "border". So, the command p$=PluckString("abc123def","c","d") would return "123" for a$, because "123" is between "c" and "d". Would make parsing HTML elements so easy!

Currently we have to do it with FOUR commands: three FindString() and one Mid(), and even the following example is buggy because "d" might be found before "c"!

Code: Select all

s$="abc123def"
p$=Mid(s$,FindString(s$,"c",1)+1,FindString(s$,"d",1)-FindString(s$,"c",1)-1)
Debug p$ ; Returns 123
EDIT! Just to clarify: the border strings should be strings, not just single characters! 8)

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 2:18 pm
by TomS
MachineCode wrote:the following example is buggy because "d" might be found before "c"!
Thanks for mentioning this. I forgot about that. That would have caused a neat little IMA in my Code^^

EDIT1: Updated to handle a situation, where the delimiter should be ignored (inside Doublequotes). Thanks to Trond: http://purebasic.fr/english/viewtopic.p ... 07#p353107

Code: Select all

Procedure Extract(input.s, del1.s, del2.s, Array output.s(1), ignoreStrings.i=0)
	
	Protected *c.Character       = @input         ; Pointer with the size of a char at the adress of the input string ( = first char in string)
	
	Protected del1_asc.i          = Asc(del1)         ; Get Asciicode for a faster comparison
	Protected del2_asc.i          = Asc(del2)         ; ^^ dito ^^
	
	Protected sameDel.i         = #False         ; A flag that is set true, if the delimiters are the same
	
	Protected state.i                           ; current state: Inside or Outside of the StringToExtract 
	Protected state_inStringDQ.i				; inside/outside a DQ-String "string"
	Protected *area_begin, *area_end               ; MemoryPointers to the Begin and End of the StringToExtract
	
	Protected arrayIndex.i = 0
	
	Protected PeekSFormat.i                     ; StringFormat (#PB_UTF8, #PB_Ascii)
	
	Select SizeOf(Character)
		Case 1
			PeekSFormat = #PB_Ascii
		Case 2
			PeekSFormat = #PB_UTF8
	EndSelect
	
	
	If del1 = del2
		sameDel = #True
	EndIf
	
	While *c\c ! 0      ; While EndOfString <> 0   
		
		
		Select *c\c
			Case '"'
				If ignoreStrings = #True
					state_inStringDQ = Abs(state_inStringDQ - 1)    
				EndIf
				
			Case del1_asc            
				
				If state_inStringDQ <> #True
					If sameDel = #True            
						state = Abs(state - 1)            ; Alternate inside/outside state
					Else
						state=1
					EndIf
					
					Select state
						Case 1
							*area_begin = *c
						Case 0
							
							*area_end = *c               
							If *area_begin>0                     
								output(arrayIndex) = PeekS(*area_begin+SizeOf(Character), *area_end - *area_begin-1, PeekSFormat)
								arrayIndex+1                  
								ReDim output(ArraySize(output())+1)
							EndIf    
							
					EndSelect
				EndIf 
				
			Case del2_asc
				If state_inStringDQ <> #True
					*area_end = *c
					If *area_begin>0               
						output(arrayIndex) = PeekS(*area_begin+SizeOf(Character), *area_end - *area_begin-1, PeekSFormat)
						arrayIndex+1                  
						ReDim output(ArraySize(output())+1)
					EndIf
				EndIf 
		EndSelect 
		*c + SizeOf(Character)                     ; Next Character
	Wend
	
	
EndProcedure

;- Example

Dim myOutput.s(1)    ;Create an Array that contains all findings

text.s="bla <result = "+Chr(34)+"x>3"+Chr(34)+"> blub"    ;our text.
d.s="<"      ;our first  "border" (entering StringToExtract)
e.s=">"      ;our second "border" (leaving  StringToExtract)


Extract(text,d,e, myOutput(), #True)


For a=0 To ArraySize(myOutput())-2      ;Let's see what we have here.
	Debug myOutput(a)
Next


;- Another Example

ReDim myOutput.s(1)

text.s="This will ?extract? strings that are framed by a ?single? character"    ;our text.
d.s="?"      ;our first  "border" (entering StringToExtract)
e.s="?"      ;our second "border" (leaving  StringToExtract)


Extract(text,d,e, myOutput())


For a=0 To ArraySize(myOutput())-2      ;Let's see what we have here.
	Debug myOutput(a)
Next 

End 

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 2:26 pm
by MachineCode
Your long code sample is precisely why we need a nice little native command to do it. :mrgreen:

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 2:36 pm
by TomS
Sure.
But until then, I think it's better than your FindString approach.
And hopefully faster (or else learning this memory stuff was all for nothing :lol: )

I think Fred should concentrate on things that can't be achieved with PB only (e.g things that are possible using the windows api but don't work on linux, like the custom cursorhandles for the CanvasGadget^^) so that we all can continue writing code that is easily portable to other operating sytems.

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 2:43 pm
by Shardik
MachineCode wrote:Would love a command that would return a string from another string, where the user specifies a first "border" and last "border".
When using StringField() you only need 2 instructions:

Code: Select all

p$ = StringField(StringField(s$, 2, "c"), 1, "d")
And with StringField() you may even use more than one character for the "borders"... :wink:

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 3:13 pm
by MachineCode

Code: Select all

s$="blah <a href='hi'>test</a> blah"
p$ = StringField(StringField(s$,2,"<a href='hi'>"),1,"</a>")
Debug p$ ; Fails to return "test".

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 3:22 pm
by TomS
Yes. Because the help says that the lenght of the delimiter can only be 1.
My code could be adapted so it works with longer delimiters, too.

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 3:28 pm
by Shardik
TomS wrote:Yes. Because the help says that the lenght of the delimiter can only be 1.
Sorry, I am getting old. TomS is right... :oops:

Nevertheless to obtain "test" you could try it even with one "border" character:

Code: Select all

s$="blah <a href='hi'>test</a> blah"
p$ = StringField(StringField(s$,2,">"),1,"<")
Debug p$ ; Doesn't fail to return "test".

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 4:17 pm
by Trond
Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:

Code: Select all

<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 4:31 pm
by TomS
I updated my code to handle this situation (delimiters within doublequotes* are ignored).
*easily expandable to single quotes and more...

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 5:52 pm
by skywalk
Good Topic...
I modified my code and noticed the FindString behavior changed with 4.6.

Code: Select all

Debug FindString("text","x")     ; Normal behavior
Debug FindString("text","x", 0)  ; Used to cause an error
Debug FindString("text","x",-1)  ; Used to cause an error
Here is what I use for "between text" searches. I'm sure it's not the fastest. :wink:

Code: Select all

CompilerIf #PB_Compiler_Unicode  
  Macro MidF(inString, StartPos, Length=-1)
    PeekS(@inString + ((StartPos - 1) * SizeOf(Character)), Length, #PB_Unicode)
  EndMacro 
CompilerElse
  Macro MidF(inString, StartPos, Length=-1)
    PeekS(@inString + ((StartPos - 1) * SizeOf(Character)), Length, #PB_Ascii)
  EndMacro  
CompilerEndIf

Procedure.s SF_Between(sSearchIn.s, sFrom.s, sTo.s, *PosAfter.integer=0, StartPos.i=1)
  ; REV:  110519, skywalk
  ;       PB 4.6 FindString() added default StartPos and accepts <=0 entries without error.
  ; Returns a string between 2 multi-char delimiters
  ; Parameters:  
  ;    sSearchIn:   String to search
  ;    sFrom:       1st keyword
  ;    sTo:         2nd keyword
  ;    PosAfter:    Text position after found String
  ;    StartPos:    Position to start search
  ; Syntax:
  ;    Debug SF_Between("<html>some text here</html>", "<html>", "</html>")
  ;    Debug SF_Between(r, "some", " ", @PosAfter, StartPos)
  Protected.i nLen1,nLen2,nLen,nLen3
  Protected.s sFound
  nLen1 = FindString(sSearchIn, sFrom, StartPos)
  If nLen1
    nLen2 = FindString(sSearchIn, sTo, nLen1 + Len(sFrom))
    If nLen2
      nLen = nLen1 + Len(sFrom)
      nLen3 = nLen2 - nLen
      sFound = MidF(sSearchIn, nLen, nLen3)
      If (nLen + nLen3 > 0) And *PosAfter  ; Avoid Null Pointer error
        *PosAfter\i = nLen2                ; used to be -> nLen
      EndIf
    EndIf
  EndIf
  ProcedureReturn sFound
EndProcedure

Define.s r = "<html>some text here</html> <html> and again </html>"
Define.i PosAfter
Debug "1-> " + SF_Between(r, "<html>", "</html>")
Debug "2-> " + SF_Between(r, "<html>", "")
Debug "3-> " + SF_Between(r, " ", " ", @PosAfter, 20)
Debug "4-> " + SF_Between(r, "some", " ", @PosAfter)
Debug "5-> " + "PosAfter = " + Str(PosAfter) + " = " + #DQUOTE$ + midf(r,PosAfter) + #DQUOTE$
Debug "6-> " + SF_Between(r, "some", " ", @PosAfter, PosAfter)
Debug "7-> " + SF_Between(r, "some", "<", @PosAfter, -1)
Debug "8-> " + "PosAfter = " + Str(PosAfter) + " = " + #DQUOTE$ + midf(r,PosAfter) + #DQUOTE$

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 6:16 pm
by Little John
The following simple code does the job:

Code: Select all

Procedure.s PluckString (main$, before$, after$, *posAfterAfter.Integer, start.i=1)
   Protected.i left, right
   
   If start > 0
      left = FindString(main$, before$, start)
      If left > 0
         left + Len(before$)
         right = FindString(main$, after$, left)
         If right > 0
            *posAfterAfter\i = right + Len(after$)
            ProcedureReturn Mid(main$, left, right-left)
         EndIf
      EndIf
   EndIf
   
   *posAfterAfter\i = 0
   ProcedureReturn ""
EndProcedure


; -- Demo
Define nextStart.i

Debug "*" + PluckString("<a href='xyz'>name</a>", "<a href='xyz'>", "</a>", @nextStart) + "*"
Debug nextStart
Debug "*" + PluckString("<a href='xyz'>name</a>", ">", "<", @nextStart) + "*"
Debug nextStart
Trond wrote:Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:

Code: Select all

<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Are you sure that that's valid HTML code? Shouldn't it be

Code: Select all

<element attribute="fhasdf>lsdgf">
?

Regards, Little John

Re: PluckString command to extract string from another

Posted: Thu May 19, 2011 9:40 pm
by Trond
Little John wrote:
Trond wrote:Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:

Code: Select all

<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Are you sure that that's valid HTML code? Shouldn't it be

Code: Select all

<element attribute="fhasdf>lsdgf">
?

Regards, Little John
I'm quite sure it's valid. > is not needed because it's inside the double quotes.

Re: PluckString command to extract string from another

Posted: Fri May 20, 2011 2:05 pm
by MachineCode
Trond wrote:<element attribute="fhasdf>lsdgf">
Not an issue at all, because the right "border" would be Chr(34)+">", not just ">" alone.

Anyway, here's what I've been using, which works great. Just wish a single command could do it.

Code: Select all

Procedure.s PluckString(text$,lborder$,rborder$)
  l=FindString(text$,lborder$,1)
  r=FindString(text$,rborder$,l+1)
  If l And r
    s=l+Len(lborder$)
    p$=Mid(text$,s,r-s)
  EndIf
  ProcedureReturn p$
EndProcedure

n$=Str(Random(9999))

Debug "Plucking: "+n$
Debug PluckString("c"+n$+"d","c","d")
Debug PluckString("abc"+n$+"def","c","d")
Debug PluckString("dc"+n$+"dc","c","d")
Debug PluckString("<a href='hi'>"+n$+"</a>","<a href='hi'>","</a>")
Debug PluckString("<attribute='"+n$+"'>","<attribute='","'>")

Re: PluckString command to extract string from another

Posted: Fri May 20, 2011 2:42 pm
by Trond
MachineCode wrote:
Trond wrote:<element attribute="fhasdf>lsdgf">
Not an issue at all, because the right "border" would be Chr(34)+">", not just ">" alone.
Now that's just making it worse. Consider:

Space between quote and >:

Code: Select all

<element name="sdfkjha>sdf" >
HTML5 has many new attributes without parameters:

Code: Select all

<audio id="asdfrq>we" src="a.avi" loop>
HTML allows single quotes around around parameters:

Code: Select all

<element name="asdfwe>ferf" id='squote'>
HTML allows omitting quotes around the parameter, as long as it contains only letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46):

Code: Select all

<element name="owe>fi" src=hi.html>