Page 2 of 4

Posted: Thu Jan 29, 2004 3:01 pm
by naw
wow! Brilliant - I bet there's only a Lib for Windows, though, Linux would be very nice. (I cant check reelmedia because the reelmedia server is refusing connections again)

Posted: Thu Jan 29, 2004 5:04 pm
by FloHimself
Kale wrote: Thats exactly what i was pointing at earlier on in this thread. 8)
I know, but does naw? ;)
naw wrote:wow! Brilliant - I bet there's only a Lib for Windows, though, Linux would be very nice.
Yes it is a Windows Lib. I couldn't compile it for linux, because i've no linux system running atm...
naw wrote:(I cant check reelmedia because the reelmedia server is refusing connections again)
Try: http://www.florian-s.com/download/PureBasic/

Posted: Thu Jan 29, 2004 5:05 pm
by tinman
naw wrote:wow! Brilliant - I bet there's only a Lib for Windows, though, Linux would be very nice. (I cant check reelmedia because the reelmedia server is refusing connections again)
There's pcre (perl compatible regular expressions) which is GPL and has been ported as both static and dynamic libraries for many platforms. If you can understand the API (and there's not much to it, maybe 6 or 8 functions you need to use) then you should be able to use them in PureBasic using either the Library library or using the DLL importer.

Posted: Thu Jan 29, 2004 6:19 pm
by naw
- more easily said than done Tinman - I'm only a casual programmer so building a Linux library is really a little beyond my ability...

Posted: Fri Jan 30, 2004 2:15 pm
by blueznl
mmmm a bit more work than i expected :-)

and some things i don't get (yet) about the syntax...

[A-Z]* means any char fro A to Z any numer of times?
[A-Z]+ means same but at least one time?

Posted: Fri Jan 30, 2004 2:26 pm
by naw
Thats it Blueznl - you've got the idea!!!

Posted: Fri Jan 30, 2004 2:38 pm
by blueznl
how to include a '[' then? [[] ?

Posted: Fri Jan 30, 2004 2:57 pm
by naw
I've started looking at the RegExpr Library, but the example given below is a bit of a mind blower for the uninitiated...

I can't begin to make sense of it - I was kind of hoping for syntax like:

result$=RegExprReplace(string$,regexpr$)
position=RegExprFind(string$,regexpr$)

- oh well...

Code: Select all

; RegCompEx Test
*compiled.REGEXP
Debug RegCompEx(@*compiled, "(<TITLE>|<title>)(.*)(</TITLE>|</title>)")

; RegErrorEx Test
error$ = Space(80)
PeekS(RegErrorEx(#REGEXP_ESPACE, *compiled, error$, 80))
Debug error$

; RegNSubExpEx Test
Debug "Number of SubExpressions: " + Str(RegNSubExpEx(*compiled))

; RegExecEx Test
Test$ = "<HTML><HEAD><TITLE>PureBasic : visual basic compiler, easy & optimized basic programming language, basic, compiler</TITLE></HEAD><BODY></BODY></HTML>"
Dim test.REGMATCH(RegNSubExpEx(*compiled))  

Debug RegExecEx(*compiled, Test$, RegNSubExpEx(*compiled), @test(0))

Debug PeekS(@Test$ + test(0)\subexp_begin) ; Test REGMATCH offset
Debug PeekS(@Test$ + test(1)\subexp_begin)
Debug PeekS(@Test$ + test(2)\subexp_begin)
Debug PeekS(@Test$ + test(3)\subexp_begin)
Debug PeekS(@Test$ + test(4)\subexp_begin)

; RegSubEx Test
*buffer = AllocateMemory(1, 200, 0) 
Debug RegSubEx(*compiled, Test$, "\2", @*buffer)
Debug "The Buffer contains: " + PeekS(*buffer)

; RegFreeEx Test
RegFreeEx(*compiled)
[/code]

Posted: Fri Jan 30, 2004 3:18 pm
by blueznl
naw, plz tell me what the following do:


abc\.
abc\\
abc.+
abc.*
abc.[

Posted: Fri Jan 30, 2004 5:54 pm
by naw

Code: Select all

abc\.     "." is a special character in RE meaning repetitions of the previous character - so "A." will match "AA" but not "AB". The "\" escapes the special meaning of the next character which is "." so effectively "abc\." matches "abc."

abc\\     "\" = escape, so "abc\\" matches "abc\"
abc.+     - sorry dont know what "+" means - never used it...
abc.*     will match "abccccc" or abcccdd234234" but not "abdefg"
abc.[      is a badly formed RE - ie "abc.[e-z]" would match "abccd" or "abccccce" or "abccccccccx" 

if you wanted to match "abc.[" you would have to use "abc\.\["

Posted: Sat Jan 31, 2004 12:09 am
by blueznl
is this valid?

"this is a test"

is matched by

"th.*a test"

pretty nasty, by the way, full support of patterns could lead to massive iterations... i think the following pseudo code should do it, still have to code it though :-)

Code: Select all

  ; concept
  ;
  ; take pattern apart, split it up in blocks
  ; per block: type (0 exact match to a number of chars, 1 fancy stuff)
  ; per block: min (0 or 1) and max (1 or n) characters
  ; put this stuff in a table
  ; and now the real stuff... l = len(string)
  ; startpos(1) = 1,  endpos(1) = l
  ; n = 1
  ;
  ; again = false
  ; repeat
  ;   try to match block(n) *as far away as possible* aka. up to endpos(n)
  ;   if match
  ;     p = found pos (last character of match, in range startpos(n) to endpos(n))
  ;     endpos(n) = p
  ;     inc n
  ;     if n< nr of blocks
  ;       startpos(n) = p+1
  ;       endpos(n) = l
  ;       again = true
  ;     endif
  ;   else
  ;     no match, damn
  ;     dec n
  ;     if n>1
  ;       endpos(n) = endpos(n)-1
  ;       again = true
  ;     endif
  ;   endif
  ; until again = false
  ;
  ; if n < 1 no match
  ; if n > nr of blocks and endpos(nr of blocks) = l then there is a match
  

some speed improvement is possible by doing a walk through first and detect startpos / endpos of some blocks... hmmm... better first code this, but now it's bedtime

Posted: Sat Jan 31, 2004 12:03 pm
by blueznl
there's a little thing unclear about the dot:
. Matches any single character except newline. In awk, dot can match newline also.
[/endquote]
abc\. "." is a special character in RE meaning repetitions of the previous character - so "A." will match "AA" but not "AB". The "\" escapes the special meaning of the next character which is "." so effectively "abc\." matches "abc."
so, what is it now?

Posted: Sat Jan 31, 2004 2:47 pm
by FloHimself
"." Matches any single character except newline. In awk, dot can match newline also.
that's correct. many unix programs deal with regexps, like grep, sed, awk, vi, some shells.. every one has special metacharcters and some are differently implemented. so its up to you to decide which implementation you will follow..

Posted: Sat Jan 31, 2004 3:03 pm
by blueznl
ok... now, another question :-)

is this valid? [ABC|CDE|X*]*

that's *very* nasty to code :-)

*edit*

hmmm... looking at the descriptions i can find on the net, there are indeed different variations :-)

ok, the following appears to be valid:

(abd|cde) which means that segment has either to match abd or cde, forcing me into a two dimensional array on my current approach

the following i haven't seen so i assume it isn't valid, or is it?

(abc|[a-z])

this is nasty to code, as backtracking is almost impossible as for every possibility i would have to backtrack, oh it can be done, recursive, but it can take ages to resolve

an alternative approach would be to check the match on the 'known' (fixed) segments, then try to fix the variable ones in between... brrr... what did i start...

Posted: Sat Jan 31, 2004 3:59 pm
by FloHimself
is this valid? [ABC|CDE|X*]*
yes it is. it matches any expression, because the last asterisk (*) means: the expression [ABC|CDE|X*] has to appear "0" or "n" times.

Edit:
the following i haven't seen so i assume it isn't valid, or is it?

(abc|[a-z])
sure this is valid, too. this is a very simple expressions: matching "abc" or any lowercase character "a", "b", "c",...,"z"

if fred would send me the pb linux version i could compile my regexp lib for linux. but he hasn't replied yet..