Parse and reformat strings

Michael Vogel · Post by **Michael Vogel** » Thu Aug 13, 2015 4:34 pm

I've written a (not so fast) tag remover, which deletes text tags in two steps. It is used to format media information text to transform "<song> by <interpret1>{ and }<interpret2>{ and }<interpret3>" to "the sound of silence by simon and garfunkel" or "amadeus by falco".

No I want to have enhanced {}-tags which remove the text dependendly what content is on the left and right side of the braces.

{123_} should keep '123' when there is regular text directly after the closing brace (so it should do the same as '{123}')
{_123} should keep '123' if there is regular text before the opening brace
{_123} should keep '123' only if there is regular text before AND after the braces

Any ideas how to do a fast working code?

Code: Select all

Procedure.s OldTagRemove(text.s)
	
	Protected n,na,nb
	
	na=1
	Repeat
		na=FindString(text,"{",na)
		nb=FindString(text,"}",na)
		If na And nb
			If nb<Len(text);			
				If Mid(text,nb+1,1)<>"{"
					text=Left(text,na-1)+Mid(text,na+1,nb-na-1)+Mid(text,nb+1)
				Else
					text=Left(text,na-1)+Mid(text,nb+1);						
				EndIf
			Else
				text=Left(text,na-1)+Mid(text,nb+1);							
			EndIf
		Else
			n=#True
		EndIf
	Until n

	ProcedureReturn text
	
EndProcedure
Procedure.s NewTagRemove(text.s)
	
	ProcedureReturn "n/a"
	
EndProcedure

Debug OldTagRemove("{1}{2}{3} *A* {1} *B* {2}{3}")

Debug NewTagRemove("{_1}{_2}{_3} *A* {_1} *B* {_2}{_3}"); should be ' *A* 1 *B* 2'
Debug NewTagRemove("{1_}{2_}{3_} *A* {1_} *B* {2_}{3_}"); should be '3 *A* 1 *B*'
Debug NewTagRemove("{_1_}{_2_}{_3_} *A* {_1_} *B* {_2_}{_3_}"); should be ' *A* 1 *B* '

RichAlgeni · Post by **RichAlgeni** » Fri Aug 14, 2015 3:35 am

Do a search for threads by members Helle and Wilbert, this will get you a number of assembler routines for searching strings. Don't want to shortchange anyone, there are good contributions to string handlers by a number of people.

wilbert · Post by **wilbert** » Fri Aug 14, 2015 4:54 am

Assembler can certainly make things faster but most important to start with is to use pointers and not so much PB string functions.

Michael Vogel wrote:{123_} should keep '123' when there is regular text directly after the closing brace (so it should do the same as '{123}')

Which characters do you consider to be regular text ?
What if a space occurs within the brackets ? (like {1 23 _ } )

Michael Vogel · Post by **Michael Vogel** » Fri Aug 14, 2015 6:39 am

wilbert wrote:Assembler can certainly make things faster but most important to start with is to use pointers and not so much PB string functions.

Michael Vogel wrote:{123_} should keep '123' when there is regular text directly after the closing brace (so it should do the same as '{123}')
Which characters do you consider to be regular text ?
What if a space occurs within the brackets ? (like {1 23 _ } )

I would say, all characters different to '{' and '}' (eventually '_' as well) should be defined as regular text.
Spaces with brackets are handled like all other chars, my old version did convert "<image name>{ - }<exif info>{ - }<exif title>" to things like "NAME", "NAME - INFO", "NAME - TITLE" and "NAME - INFO - TITLE" depending what content was present in the <>-tags.

I was thinking to add temporary '}' in front and '{' of the string and replace all content of the following bracket combinations '}{_...}', '{..._}{' (and for compatibility '{...}{') and '}{_..._}{' (the dots are representing any characters) first and the remove all '{_}', then all '{' and at the end all '}' by nil.

My first attempt was to do a code for '{..._}' cases only, but it is not easy to change the routine to handle '{...}' identically nor to deal with '{_...}' and '{_..._}'...

Code: Select all

Procedure.s NewTagRemove(text.s)
	
	Enumeration 
		#StateNil
		#StateText
	EndEnumeration 
	
	state=#StateNil
	
	text="}"+text+"{"
	Debug text
	Debug "-------"
	
	n=Len(text)
	While n
		b=PeekA(@text+n-1)
		Select b
			Case '{'
				cclose=n
				If copen=ctype+1
					If state=#StateText
						text=Left(text,cclose-1)+Mid(text,cclose+1,ctype-cclose-1)+Mid(text,copen+1)
						Debug "A: "+text
					Else
						text=Left(text,cclose)+Mid(text,copen)
						Debug "B: "+text
					EndIf
				EndIf
			Case '_'
				ctype=n
			Case '}'
				copen=n
				If copen=cclose-1
					state=#StateNil
				Else
					state=#StateText
				EndIf
			Default
				creg=n
		EndSelect
		n-1
		
	Wend
	
	ProcedureReturn ReplaceString(ReplaceString(text,"{",""),"}","")
	
EndProcedure

;Debug OldTagRemove("{1}{2}{3} *A* {1} *B* {2}{3}")
;Debug NewTagRemove("{_1}{_2}{_3} *A* {_1} *B* {_2}{_3}")
Debug NewTagRemove("{1_}{2_}{3_} *A* {1_} *B* {2_}{3_}")
;Debug NewTagRemove("{_1_}{_2_}{_3_} *A* {_1_} *B* {_2_}{_3_}")

wilbert · Post by **wilbert** » Fri Aug 14, 2015 8:44 am

Here's my attempt.
It seems to work but I'm not 100% sure. I also didn't check the speed.

Code: Select all

Procedure.s OldTagRemove(text.s)
  
  Protected n,na,nb
  
  na=1
  Repeat
    na=FindString(text,"{",na)
    nb=FindString(text,"}",na)
    If na And nb
      If nb<Len(text);         
        If Mid(text,nb+1,1)<>"{"
          text=Left(text,na-1)+Mid(text,na+1,nb-na-1)+Mid(text,nb+1)
        Else
          text=Left(text,na-1)+Mid(text,nb+1);                  
        EndIf
      Else
        text=Left(text,na-1)+Mid(text,nb+1);                     
      EndIf
    Else
      n=#True
    EndIf
  Until n
  
  ProcedureReturn text
  
EndProcedure



; *** NewTagRemove code ***

#SoC = SizeOf(Character)

Structure CharArray
  i.c[0]
EndStructure

Procedure.i FindChar(*s.Character, char.u)
  !movzx ecx, word [p.v_char]
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64  
    !mov rax, [p.p_s]
    !test rax, rax
    !jz findchar_exit
    CompilerIf #PB_Compiler_Unicode
      !sub rax, 2
      !findchar_loop:
      !add rax, 2
      !movzx edx, word [rax]
    CompilerElse
      !sub rax, 1
      !findchar_loop:
      !add rax, 1
      !movzx edx, byte [rax]
    CompilerEndIf
  CompilerElse
    !mov eax, [p.p_s]
    !test eax, eax
    !jz findchar_exit    
    CompilerIf #PB_Compiler_Unicode
      !sub eax, 2
      !findchar_loop:
      !add eax, 2
      !movzx edx, word [eax]
    CompilerElse
      !sub eax, 1
      !findchar_loop:
      !add eax, 1
      !movzx edx, byte [eax]
    CompilerEndIf    
  CompilerEndIf
  !cmp edx, ecx
  !je findchar_exit
  !test edx, edx
  !jnz findchar_loop
  !xor eax, eax
  !findchar_exit:
  ProcedureReturn
EndProcedure

Procedure.s NewTagRemove(text.s)
  
  Protected.CharArray *c0_, *c1_, *c0, *c1 = @text
  Protected result.s = text, *r.Character = @result
  Protected keep
  
  *c0 = FindChar(*c1, '{')
  *c1 = FindChar(*c0, '}')
  
  If *c1 = 0
    ProcedureReturn result
  Else
    *r + *c0 - @text
  EndIf
    
  While *c1 > *c0
    
    *c0_ = *c0
    *c1_ = *c1
    
    keep = #True
    If *c0\i[1] = '_'
      ; process {_ ...
      If *c0 <> @text And *c0\i[-1] <> '}'
        *c0_ + #SoC
      Else
        keep = #False
      EndIf
    EndIf
    If *c1\i[-1] = '_' Or *c0\i[1] <> '_'
      ; process ... _} and {...}
      If *c1\i[1] And *c1\i[1] <> '{'
        If *c1\i[-1] = '_'
          *c1_ - #SoC
        EndIf
      Else
        keep = #False
      EndIf
    EndIf
    
    If keep
      *c0_ + #SoC
      CopyMemory(*c0_, *r, *c1 - *c0)
      *r + *c1_ - *c0_
    EndIf
    
    *c0 = FindChar(*c1, '{')
    *c1 + #SoC
    If *c0 = 0
      *c0 = FindChar(*c1, 0)
    EndIf
    CopyMemory(*c1, *r, *c0 - *c1)
    *r + *c0 - *c1
    *c1 = FindChar(*c0, '}')
  Wend
  
  *r\c = 0
  ProcedureReturn result
  
EndProcedure

Debug OldTagRemove("{1}{2}{3} *A* {1} *B* {2}{3}")
Debug NewTagRemove("{1}{2}{3} *A* {1} *B* {2}{3}")

Debug NewTagRemove("{_1}{_2}{_3} *A* {_1} *B* {_2}{_3}"); should be ' *A* 1 *B* 2'
Debug NewTagRemove("{1_}{2_}{3_} *A* {1_} *B* {2_}{3_}"); should be '3 *A* 1 *B*'
Debug NewTagRemove("{_1_}{_2_}{_3_} *A* {_1_} *B* {_2_}{_3_}"); should be ' *A* 1 *B* '

Michael Vogel · Post by **Michael Vogel** » Fri Aug 14, 2015 1:21 pm

Great, works fine, I only saw a small issue with trailing characters, like in "{abc}xyz". So I added some lines at the end which deal with that as well, so I will do some further testing and then I will use your code in my program - Wilbert, I have to thank you (once again).

Code: Select all

#SoC=SizeOf(Character)

Structure CharArray
	i.c[0]
EndStructure

Procedure.i FindChar(*s.Character,char.u)

	!movzx ecx, word [p.v_char]
	CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
		!mov rax, [p.p_s]
		!test rax, rax
		!jz findchar_exit
		CompilerIf #PB_Compiler_Unicode
			!sub rax, 2
			!findchar_loop:
			!add rax, 2
			!movzx edx, word [rax]
		CompilerElse
			!sub rax, 1
			!findchar_loop:
			!add rax, 1
			!movzx edx, byte [rax]
		CompilerEndIf

	CompilerElse
		!mov eax, [p.p_s]
		!test eax, eax
		!jz findchar_exit
		CompilerIf #PB_Compiler_Unicode
			!sub eax, 2
			!findchar_loop:
			!add eax, 2
			!movzx edx, word [eax]
		CompilerElse
			!sub eax, 1
			!findchar_loop:
			!add eax, 1
			!movzx edx, byte [eax]
		CompilerEndIf
	CompilerEndIf

	!cmp edx, ecx
	!je findchar_exit
	!test edx, edx
	!jnz findchar_loop
	!xor eax, eax
	!findchar_exit:

	ProcedureReturn

EndProcedure
Procedure.s NewTagRemove(text.s)

	Protected.CharArray *c0_,*c1_,*c0,*c1
	Protected result.s
	Protected *r.Character
	Protected keep

	result=text

	*c1=@text
	*r=@result

	*c0=FindChar(*c1,'{');				position of (first) '{'
	If *c0
		*c1=FindChar(*c0,'}');			position of '}' (after the '}')
	Else
		ProcedureReturn result;			no '{' present...
	EndIf
	
	If *c1
		*r+*c0-@text;					keep text before '{' unchanged
	Else
		ProcedureReturn result;			no '}' present...
	EndIf

	While *c1>*c0

		*c0_=*c0;						position of '{', then the first character after '{'
		*c1_=*c1;						position of '}', then the last character before '}'

		keep=#True
		If *c0\i[1]='_';					process '{_...}'
			If *c0>@text And *c0\i[-1]<>'}'; not the first '{...}' and not '}{...}'
				*c0_+#SoC;				set left position after '_' character (*)
			Else
				keep=#False;			otherwise: remove tag content
			EndIf
		EndIf
		If *c1\i[-1]='_' Or *c0\i[1]<>'_';	process '{..._}' and '{...}'
			If *c1\i[1] And *c1\i[1]<>'{';	not the last '{...}' and not '{...}{'
				If *c1\i[-1]='_'
					*c1_-#SoC;			set right position before '_' character (*)
				EndIf
			Else
				keep=#False;			otherwise: remove tag content
			EndIf
		EndIf

		If keep
			*c0_+#SoC;					set position marker (*)
			CopyMemory(*c0_,*r,*c1-*c0)
			*r +*c1_-*c0_;				adapt result pointer
		EndIf

		*c0=FindChar(*c1,'{');			next '{'...
		If *c0
			*c1+#SoC					
			CopyMemory(*c1,*r,*c0-*c1); fill characters (before next '{')
			*r+*c0-*c1
			*c1=FindChar(*c0,'}')
		Else
			*c1+#SoC					
			keep=Len(text)+@text-*c1
			CopyMemory(*c1,*r,keep); 	fill ending characters (behind last '}')
			*r+keep
			*c1=#Null
		EndIf
	Wend

	*r\c=0

	ProcedureReturn result

EndProcedure

Debug NewTagRemove("{1}{2}{3} *A* {4} *B* {5}{6}")
Debug NewTagRemove("{1_}{2_}{3_} *A* {4_} *B* {5_}{6_}")
Debug NewTagRemove("{_1}{_2}{_3} *A* {_4} *B* {_5}{_6}")
Debug NewTagRemove("{_}{_}{_} *A* {_} *B* {_}{_}")
Debug NewTagRemove("{_1_}{_2_}{_3_} *A* {_4_} *B* {_5_}{_6_}")

Debug NewTagRemove("Test 1{ (_}yep{_)}")
Debug NewTagRemove("Test 2{ (_}{_)}")
Debug NewTagRemove("{abc}trailing")

Debug NewTagRemove("Illegal 1{{{1}{{{2")
Debug NewTagRemove("Illegal 2}}}{1{}}}2")
Debug NewTagRemove("Illegal 3{_a{b{}")

wilbert · Post by **wilbert** » Fri Aug 14, 2015 1:37 pm

Michael Vogel wrote:Great, works fine, I only saw a small issue with trailing characters, like in "{abc}xyz". So I added some lines at the end which deal with that as well, so I will do some further testing and then I will use your code in my program

I know, I noticed the same problem and had already corrected it.

Maybe I should have added an extra post to let you know since you now had a version that wasn't the most recent one.
The fix you made yourself works fine on OSX in ascii mode but doesn't seem to work when unicode is enabled.

Michael Vogel · Post by **Michael Vogel** » Sat Aug 15, 2015 9:45 am

wilbert wrote:Maybe I should have added an extra post to let you know since you now had a version that wasn't the most recent one.
The fix you made yourself works fine on OSX in ascii mode but doesn't seem to work when unicode is enabled.

Len(text) mst be the problem in my code, by changing the appropriate line to keep=Len(text)*#Soc+@text-*c1 should make it running correctly. Thanks for your cool help, meanwhile I am using tons of your (assembler) codes in my program

PS: the Len(text)*#Soc means slower exectuion time also in ASCII mode, the compiler doesn't eliminate multiplications by one.

wilbert · Post by **wilbert** » Sat Aug 15, 2015 9:57 am

Michael Vogel wrote:Len(text) mst be the problem in my code, by changing the appropriate line to keep=Len(text)*#Soc+@text-*c1 should make it running correctly. Thanks for your cool help, meanwhile I am using tons of your (assembler) codes in my program

If Len is the problem, you could try StringByteLength . That way you don't need the multiplication.

PureBasic Forums - English

Parse and reformat strings

Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings

Re: Parse and reformat strings