PluckString command to extract string from another

Got an idea for enhancing PureBasic? New command(s) you'd like to see?
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

PluckString command to extract string from another

Post by MachineCode »

Would love a command that would return a string from another string, where the user specifies a first "border" and last "border". So, the command p$=PluckString("abc123def","c","d") would return "123" for a$, because "123" is between "c" and "d". Would make parsing HTML elements so easy!

Currently we have to do it with FOUR commands: three FindString() and one Mid(), and even the following example is buggy because "d" might be found before "c"!

Code: Select all

s$="abc123def"
p$=Mid(s$,FindString(s$,"c",1)+1,FindString(s$,"d",1)-FindString(s$,"c",1)-1)
Debug p$ ; Returns 123
EDIT! Just to clarify: the border strings should be strings, not just single characters! 8)
Last edited by MachineCode on Thu May 19, 2011 2:28 pm, edited 1 time in total.
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: PluckString command to extract string from another

Post by TomS »

MachineCode wrote:the following example is buggy because "d" might be found before "c"!
Thanks for mentioning this. I forgot about that. That would have caused a neat little IMA in my Code^^

EDIT1: Updated to handle a situation, where the delimiter should be ignored (inside Doublequotes). Thanks to Trond: http://purebasic.fr/english/viewtopic.p ... 07#p353107

Code: Select all

Procedure Extract(input.s, del1.s, del2.s, Array output.s(1), ignoreStrings.i=0)
	
	Protected *c.Character       = @input         ; Pointer with the size of a char at the adress of the input string ( = first char in string)
	
	Protected del1_asc.i          = Asc(del1)         ; Get Asciicode for a faster comparison
	Protected del2_asc.i          = Asc(del2)         ; ^^ dito ^^
	
	Protected sameDel.i         = #False         ; A flag that is set true, if the delimiters are the same
	
	Protected state.i                           ; current state: Inside or Outside of the StringToExtract 
	Protected state_inStringDQ.i				; inside/outside a DQ-String "string"
	Protected *area_begin, *area_end               ; MemoryPointers to the Begin and End of the StringToExtract
	
	Protected arrayIndex.i = 0
	
	Protected PeekSFormat.i                     ; StringFormat (#PB_UTF8, #PB_Ascii)
	
	Select SizeOf(Character)
		Case 1
			PeekSFormat = #PB_Ascii
		Case 2
			PeekSFormat = #PB_UTF8
	EndSelect
	
	
	If del1 = del2
		sameDel = #True
	EndIf
	
	While *c\c ! 0      ; While EndOfString <> 0   
		
		
		Select *c\c
			Case '"'
				If ignoreStrings = #True
					state_inStringDQ = Abs(state_inStringDQ - 1)    
				EndIf
				
			Case del1_asc            
				
				If state_inStringDQ <> #True
					If sameDel = #True            
						state = Abs(state - 1)            ; Alternate inside/outside state
					Else
						state=1
					EndIf
					
					Select state
						Case 1
							*area_begin = *c
						Case 0
							
							*area_end = *c               
							If *area_begin>0                     
								output(arrayIndex) = PeekS(*area_begin+SizeOf(Character), *area_end - *area_begin-1, PeekSFormat)
								arrayIndex+1                  
								ReDim output(ArraySize(output())+1)
							EndIf    
							
					EndSelect
				EndIf 
				
			Case del2_asc
				If state_inStringDQ <> #True
					*area_end = *c
					If *area_begin>0               
						output(arrayIndex) = PeekS(*area_begin+SizeOf(Character), *area_end - *area_begin-1, PeekSFormat)
						arrayIndex+1                  
						ReDim output(ArraySize(output())+1)
					EndIf
				EndIf 
		EndSelect 
		*c + SizeOf(Character)                     ; Next Character
	Wend
	
	
EndProcedure

;- Example

Dim myOutput.s(1)    ;Create an Array that contains all findings

text.s="bla <result = "+Chr(34)+"x>3"+Chr(34)+"> blub"    ;our text.
d.s="<"      ;our first  "border" (entering StringToExtract)
e.s=">"      ;our second "border" (leaving  StringToExtract)


Extract(text,d,e, myOutput(), #True)


For a=0 To ArraySize(myOutput())-2      ;Let's see what we have here.
	Debug myOutput(a)
Next


;- Another Example

ReDim myOutput.s(1)

text.s="This will ?extract? strings that are framed by a ?single? character"    ;our text.
d.s="?"      ;our first  "border" (entering StringToExtract)
e.s="?"      ;our second "border" (leaving  StringToExtract)


Extract(text,d,e, myOutput())


For a=0 To ArraySize(myOutput())-2      ;Let's see what we have here.
	Debug myOutput(a)
Next 

End 
Last edited by TomS on Thu May 19, 2011 4:28 pm, edited 1 time in total.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: PluckString command to extract string from another

Post by MachineCode »

Your long code sample is precisely why we need a nice little native command to do it. :mrgreen:
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: PluckString command to extract string from another

Post by TomS »

Sure.
But until then, I think it's better than your FindString approach.
And hopefully faster (or else learning this memory stuff was all for nothing :lol: )

I think Fred should concentrate on things that can't be achieved with PB only (e.g things that are possible using the windows api but don't work on linux, like the custom cursorhandles for the CanvasGadget^^) so that we all can continue writing code that is easily portable to other operating sytems.
User avatar
Shardik
Addict
Addict
Posts: 2058
Joined: Thu Apr 21, 2005 2:38 pm
Location: Germany

Re: PluckString command to extract string from another

Post by Shardik »

MachineCode wrote:Would love a command that would return a string from another string, where the user specifies a first "border" and last "border".
When using StringField() you only need 2 instructions:

Code: Select all

p$ = StringField(StringField(s$, 2, "c"), 1, "d")
And with StringField() you may even use more than one character for the "borders"... :wink:
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: PluckString command to extract string from another

Post by MachineCode »

Code: Select all

s$="blah <a href='hi'>test</a> blah"
p$ = StringField(StringField(s$,2,"<a href='hi'>"),1,"</a>")
Debug p$ ; Fails to return "test".
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: PluckString command to extract string from another

Post by TomS »

Yes. Because the help says that the lenght of the delimiter can only be 1.
My code could be adapted so it works with longer delimiters, too.
User avatar
Shardik
Addict
Addict
Posts: 2058
Joined: Thu Apr 21, 2005 2:38 pm
Location: Germany

Re: PluckString command to extract string from another

Post by Shardik »

TomS wrote:Yes. Because the help says that the lenght of the delimiter can only be 1.
Sorry, I am getting old. TomS is right... :oops:

Nevertheless to obtain "test" you could try it even with one "border" character:

Code: Select all

s$="blah <a href='hi'>test</a> blah"
p$ = StringField(StringField(s$,2,">"),1,"<")
Debug p$ ; Doesn't fail to return "test".
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: PluckString command to extract string from another

Post by Trond »

Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:

Code: Select all

<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
User avatar
TomS
Enthusiast
Enthusiast
Posts: 342
Joined: Sun Mar 18, 2007 2:26 pm
Location: Munich, Germany

Re: PluckString command to extract string from another

Post by TomS »

I updated my code to handle this situation (delimiters within doublequotes* are ignored).
*easily expandable to single quotes and more...
User avatar
skywalk
Addict
Addict
Posts: 4210
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: PluckString command to extract string from another

Post by skywalk »

Good Topic...
I modified my code and noticed the FindString behavior changed with 4.6.

Code: Select all

Debug FindString("text","x")     ; Normal behavior
Debug FindString("text","x", 0)  ; Used to cause an error
Debug FindString("text","x",-1)  ; Used to cause an error
Here is what I use for "between text" searches. I'm sure it's not the fastest. :wink:

Code: Select all

CompilerIf #PB_Compiler_Unicode  
  Macro MidF(inString, StartPos, Length=-1)
    PeekS(@inString + ((StartPos - 1) * SizeOf(Character)), Length, #PB_Unicode)
  EndMacro 
CompilerElse
  Macro MidF(inString, StartPos, Length=-1)
    PeekS(@inString + ((StartPos - 1) * SizeOf(Character)), Length, #PB_Ascii)
  EndMacro  
CompilerEndIf

Procedure.s SF_Between(sSearchIn.s, sFrom.s, sTo.s, *PosAfter.integer=0, StartPos.i=1)
  ; REV:  110519, skywalk
  ;       PB 4.6 FindString() added default StartPos and accepts <=0 entries without error.
  ; Returns a string between 2 multi-char delimiters
  ; Parameters:  
  ;    sSearchIn:   String to search
  ;    sFrom:       1st keyword
  ;    sTo:         2nd keyword
  ;    PosAfter:    Text position after found String
  ;    StartPos:    Position to start search
  ; Syntax:
  ;    Debug SF_Between("<html>some text here</html>", "<html>", "</html>")
  ;    Debug SF_Between(r, "some", " ", @PosAfter, StartPos)
  Protected.i nLen1,nLen2,nLen,nLen3
  Protected.s sFound
  nLen1 = FindString(sSearchIn, sFrom, StartPos)
  If nLen1
    nLen2 = FindString(sSearchIn, sTo, nLen1 + Len(sFrom))
    If nLen2
      nLen = nLen1 + Len(sFrom)
      nLen3 = nLen2 - nLen
      sFound = MidF(sSearchIn, nLen, nLen3)
      If (nLen + nLen3 > 0) And *PosAfter  ; Avoid Null Pointer error
        *PosAfter\i = nLen2                ; used to be -> nLen
      EndIf
    EndIf
  EndIf
  ProcedureReturn sFound
EndProcedure

Define.s r = "<html>some text here</html> <html> and again </html>"
Define.i PosAfter
Debug "1-> " + SF_Between(r, "<html>", "</html>")
Debug "2-> " + SF_Between(r, "<html>", "")
Debug "3-> " + SF_Between(r, " ", " ", @PosAfter, 20)
Debug "4-> " + SF_Between(r, "some", " ", @PosAfter)
Debug "5-> " + "PosAfter = " + Str(PosAfter) + " = " + #DQUOTE$ + midf(r,PosAfter) + #DQUOTE$
Debug "6-> " + SF_Between(r, "some", " ", @PosAfter, PosAfter)
Debug "7-> " + SF_Between(r, "some", "<", @PosAfter, -1)
Debug "8-> " + "PosAfter = " + Str(PosAfter) + " = " + #DQUOTE$ + midf(r,PosAfter) + #DQUOTE$
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
Little John
Addict
Addict
Posts: 4775
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: PluckString command to extract string from another

Post by Little John »

The following simple code does the job:

Code: Select all

Procedure.s PluckString (main$, before$, after$, *posAfterAfter.Integer, start.i=1)
   Protected.i left, right
   
   If start > 0
      left = FindString(main$, before$, start)
      If left > 0
         left + Len(before$)
         right = FindString(main$, after$, left)
         If right > 0
            *posAfterAfter\i = right + Len(after$)
            ProcedureReturn Mid(main$, left, right-left)
         EndIf
      EndIf
   EndIf
   
   *posAfterAfter\i = 0
   ProcedureReturn ""
EndProcedure


; -- Demo
Define nextStart.i

Debug "*" + PluckString("<a href='xyz'>name</a>", "<a href='xyz'>", "</a>", @nextStart) + "*"
Debug nextStart
Debug "*" + PluckString("<a href='xyz'>name</a>", ">", "<", @nextStart) + "*"
Debug nextStart
Trond wrote:Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:

Code: Select all

<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Are you sure that that's valid HTML code? Shouldn't it be

Code: Select all

<element attribute="fhasdf>lsdgf">
?

Regards, Little John
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: PluckString command to extract string from another

Post by Trond »

Little John wrote:
Trond wrote:Parsing HTML is a lot more complex than returning whatever is between < and >. Consider this example:

Code: Select all

<element attribute="fhasdf>lsdgf">
The first > is part of the attribute value, so it must not be interpreted as the end of the tag.
Are you sure that that's valid HTML code? Shouldn't it be

Code: Select all

<element attribute="fhasdf>lsdgf">
?

Regards, Little John
I'm quite sure it's valid. > is not needed because it's inside the double quotes.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: PluckString command to extract string from another

Post by MachineCode »

Trond wrote:<element attribute="fhasdf>lsdgf">
Not an issue at all, because the right "border" would be Chr(34)+">", not just ">" alone.

Anyway, here's what I've been using, which works great. Just wish a single command could do it.

Code: Select all

Procedure.s PluckString(text$,lborder$,rborder$)
  l=FindString(text$,lborder$,1)
  r=FindString(text$,rborder$,l+1)
  If l And r
    s=l+Len(lborder$)
    p$=Mid(text$,s,r-s)
  EndIf
  ProcedureReturn p$
EndProcedure

n$=Str(Random(9999))

Debug "Plucking: "+n$
Debug PluckString("c"+n$+"d","c","d")
Debug PluckString("abc"+n$+"def","c","d")
Debug PluckString("dc"+n$+"dc","c","d")
Debug PluckString("<a href='hi'>"+n$+"</a>","<a href='hi'>","</a>")
Debug PluckString("<attribute='"+n$+"'>","<attribute='","'>")
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: PluckString command to extract string from another

Post by Trond »

MachineCode wrote:
Trond wrote:<element attribute="fhasdf>lsdgf">
Not an issue at all, because the right "border" would be Chr(34)+">", not just ">" alone.
Now that's just making it worse. Consider:

Space between quote and >:

Code: Select all

<element name="sdfkjha>sdf" >
HTML5 has many new attributes without parameters:

Code: Select all

<audio id="asdfrq>we" src="a.avi" loop>
HTML allows single quotes around around parameters:

Code: Select all

<element name="asdfwe>ferf" id='squote'>
HTML allows omitting quotes around the parameter, as long as it contains only letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46):

Code: Select all

<element name="owe>fi" src=hi.html>
Post Reply