Yet another Regular Expression tester [adapted for Unicode]

Share your advanced PureBasic knowledge/code with the community.
User avatar
fiver
User
User
Posts: 36
Joined: Wed May 05, 2004 8:21 pm
Location: An outer spiral arm of the Milky Way

Yet another Regular Expression tester [adapted for Unicode]

Post by fiver »

While working on a script parser that involves a lot of regex I came across kenmo's Regex Tester which I have found makes life a lot easier :wink:
I have updated it a little adding a basic save facility, file load and scintilla gadgets (mostly for highlighting the matches but also for line numbering), code below, have fun!

EDIT:
Altered to support unicode and removed the second superfluous scintilla gadget, see the revised code below. This topic: Scintilla findtext and unicode proved very helpful particularly ts-soft's include and srod's comment about the underlying buffer vs the visible characters. This code is still a bit rough around the edges, files without proper byte order marks may cause problems, you can use a hex editor to correct these or run the file through the PB IDE and save it as utf8.

EDIT:
As pointed out by Danilo this won't work in unicode mode as is, because the memory allocation for the Scintilla gadgets is ascii only in it's current form.

EDIT:
Should now work on Windows and Linux
  • Fixed error pointed out by IdeasVacuum
  • Added code for Scintilla gadgets on windows, tested on XP W32

Image

Code: Select all

; Based largely on Regex Tester by kenmo, see:
; http://www.purebasic.fr/english/viewtopic.php?f=12&t=42839
; Tweaked by Fiver added:
; Scintilla gadgets, mostly for line numbers / highlighting matches
; Fairly basic file based save and load regex facilities
; Open file into sample gadget
; Added unicode support

;-- Or not, disable if you just want ascii 
CompilerIf #PB_Compiler_Unicode = 0
CompilerError "Whoa there crazy horse, compile with unicode mode enabled...."
CompilerEndIf

Global StringRegex, Save, Sample
Global WindowMain, WindowLibrary, ListIconRegex

; Copy the Scintilla.dll from the PB compilers directory into the same directory as this code
CompilerSelect #PB_Compiler_OS
CompilerCase #PB_OS_Windows
InitScintilla("Scintilla.dll")
CompilerEndSelect

IncludeFile "RegExTesterProcs.pb"
;--IncludeFile "SciProcTest.pb"
; Set default font
; Monaco is a really nice fixed width font, it should be available on osx, for
; Linux and Windows get it from:
; http://goo.gl/DIVgS
If LoadFont(1, "Monaco", 10)
  SetGadgetFont(#PB_Default, FontID(1))
EndIf

Flags = #PB_Window_MinimizeGadget|#PB_Window_ScreenCentered

WindowMain = OpenWindow(#PB_Any, 0, 0, 800, 600, "Regex Tester", Flags)

TextGadget(0, 15, 13, 100, 25, "Regex:", #PB_Text_Right)
StringRegex = StringGadget(#PB_Any, 120, 10, 535, 30, "[a-zA-Z]+")
ButtonSave = ButtonGadget(#PB_Any, 665, 10, 60, 30, "Save")
ButtonOpen = ButtonGadget(#PB_Any, 730, 10, 60, 30, "Open")
TextGadget(4, 15, 55, 100, 25, "Matches:", #PB_Text_Right)
TextGadget(5, 120, 55, 100, 25, "", #PB_Text_Center)
TextGadget(2, 15, 85, 100, 25, "Test Text:", #PB_Text_Right)
Sample = ScintillaGadget(#PB_Any, 10, 110, 780, 480, @ScintillaCallBack())
SetUpLexer(Sample)
AddKeyboardShortcut(WindowMain, #PB_Shortcut_Return, 1)

SciSetText(Sample, "Hello World 98745, hello! anyone home? 123, Dave?.. 456is that you Dave?")


Controls = ContainerGadget(#PB_Any, 350, 40, 440, 60)
CheckBoxGadget( 7, 10,  10, 150, 20, "Dot All")
CheckBoxGadget( 8, 10,  30, 150, 20, "Extended")
CheckBoxGadget( 9, 160, 10, 150, 20, "MultiLine")
CheckBoxGadget(10, 160, 30, 150, 20, "Any NewLine")

;ButtonGadget(11, 320, 10, 120, 30, "Sticky Window", #PB_Button_Toggle)
;SetGadgetState(11, #True)
ButtonGadget(12, 315, 20, 110, 30, "Open File", #PB_Button_Default)
CloseGadgetList()

StickyWindow(WindowMain, #True)
SetActiveGadget(StringRegex)

; Regex Test Procedures

Procedure TestIt()
  Static Filter.s, Input$, Regex.i, Matches.i, Flags.i, i.i
  Static Dim Match.s(0)

  SetGadgetText(5, "")
  ;-- Clear existing styling
  ClearAllStyling(Sample)

  Filter = GetGadgetText(StringRegex)
  If (Filter)
    Input$ = SciGetText(Sample)

    Flags = (GetGadgetState( 7) * #PB_RegularExpression_DotAll)
    Flags | (GetGadgetState( 8) * #PB_RegularExpression_Extended)
    Flags | (GetGadgetState( 9) * #PB_RegularExpression_MultiLine)
    Flags | (GetGadgetState(10) * #PB_RegularExpression_AnyNewLine)

    Regex = CreateRegularExpression(#PB_Any, Filter, Flags)
    If (Regex)
      Matches = ExtractRegularExpression(Regex, Input$, Match())
      If (Matches > 0)
        ;-- Number of matches
        SetGadgetText(5, Str(Matches))
        For i = 0 To Matches - 1
          Debug Match(i)
          SetMatchedHighlighted(Sample, Match(i))
        Next i
      Else
        SetGadgetText(5, "none")
      EndIf
      FreeRegularExpression(Regex)
    EndIf
  EndIf
EndProcedure

; Program Loop

TestIt()
Exit = #False

Repeat
  Event = WaitWindowEvent(5)
  Window = EventWindow()
  Type = EventType()
  Menu = EventMenu()
  
  If (Event = #PB_Event_CloseWindow)
    If Window = WindowLibrary
      HideWindow(WindowLibrary, 1)
    Else
      Exit = #True
    EndIf
  ElseIf (Event = #PB_Event_Gadget)
    ID = EventGadget()
    Select ID
      Case StringRegex
        If (Type = #PB_EventType_Change)
          TestIt()
        EndIf
      Case 7, 8, 9, 10
        TestIt()
      Case ButtonSave
        Description$ = InputRequester("Save - add description:", "Enter a description for this Regular Expression.","")
        If Description$ <> ""
          SaveRegex(Description$)
          MessageRequester("Save", "Saved to library.")
        EndIf
      Case ButtonOpen
        OpenWindowLibrary()
      Case ListIconRegex
        If (Type = #PB_EventType_LeftDoubleClick)
          SetGadgetText(StringRegex,GetGadgetItemText(ListIconRegex, GetGadgetState(ListIconRegex), 1))
          TestIt()
        EndIf
      Case 12
        SciLoadFile(Sample)

        TestIt()
    EndSelect
  ElseIf (Event = #PB_Event_Menu)
    Select Menu
      Case 1
        ;--Return
        If GetActiveGadget() = Sample
          ScintillaSendMessage(Sample, #SCI_NEWLINE)
          TestIt()
        EndIf
    EndSelect
  EndIf
Until Exit
End

And the include file (mostly general purpose scintilla procs):

Code: Select all

;-- RegExTesterProcs.pb

;-- Non printing assii character
;-- Debug Chr(31) ;--Unit Seperator, scite says US on a black background
#US$ = Chr(31)
;-- Standard settings of 5 style bits and 3 indicator bits
#SCI_STYLE_MASK = 31

Structure STOREDREGEX
  ID$
  RegEx$
  Description$
EndStructure


Enumeration
  #STYLE_LEXERSTATE_MATCH = 1
EndEnumeration

Global FontSize=12

Procedure SetUpLexer(Gadget)
  ;-- Doesn't really matter, irrelevant for ascii and the gadget will get set
  ;-- for unicode and utf8 later as required
  ScintillaSendMessage(Gadget, #SCI_SETCODEPAGE, #SC_CP_UTF8)
  ScintillaSendMessage(Gadget, #SCI_SETLEXER, #SCLEX_CONTAINER)
  ScintillaSendMessage(Gadget, #SCI_STYLESETSIZE, #STYLE_DEFAULT, FontSize)
  ;-- Not suited to unicode, ignored by scintilla
  CompilerSelect #PB_Compiler_OS
  CompilerCase #PB_OS_Windows
  ScintillaSendMessage(Gadget, #SCI_STYLESETFONT, #STYLE_DEFAULT,  @"Monaco")
  CompilerCase #PB_OS_Linux
  ScintillaSendMessage(Gadget, #SCI_STYLESETFONT, #STYLE_DEFAULT,  @"!Monaco")
  CompilerCase #PB_OS_MacOS
  ScintillaSendMessage(Gadget, #SCI_STYLESETFONT, #STYLE_DEFAULT,  @"Monaco")
  CompilerEndSelect
  ;-- Set column marker
  ScintillaSendMessage(Gadget, #SCI_SETEDGEMODE, 1)
  ScintillaSendMessage(Gadget, #SCI_SETEDGECOLUMN, 85)
  ScintillaSendMessage(Gadget, #SCI_SETEDGECOLOUR, $C0DCC0)
  ;-- Line number margin, width relative to default font size
  ScintillaSendMessage(Gadget, #SCI_SETMARGINWIDTHN, 0, 5 * FontSize)
  ScintillaSendMessage(Gadget, #SCI_SETMARGINWIDTHN, 1, 1)
  ScintillaSendMessage(Gadget, #SCI_SETMARGINWIDTHN, 2, 15)
  ;-- Set default style
  ScintillaSendMessage(Gadget, #SCI_STYLESETFORE, #STYLE_DEFAULT, $000000)
  ScintillaSendMessage(Gadget, #SCI_STYLESETBACK, #STYLE_DEFAULT, $FFFFFF )
  ScintillaSendMessage(Gadget, #SCI_STYLESETBOLD, #STYLE_DEFAULT, 0)
  ;ScintillaSendMessage(Gadget, #SCI_STYLECLEARALL);
  ;-- Set highlight style
  ScintillaSendMessage(Gadget, #SCI_STYLESETBOLD, #STYLE_LEXERSTATE_MATCH, 0)
  ScintillaSendMessage(Gadget, #SCI_STYLESETBACK, #STYLE_LEXERSTATE_MATCH, $00CC00)
EndProcedure

Declare TestIt()

ProcedureDLL ScintillaCallBack(Gadget, *scinotify.SCNotification)
  Global CharCount, CBWord$
  If *scinotify\nmhdr\code = #SCN_CHARADDED
    If *scinotify\ch <> 32
      ;-- Whats Chr() doing? is it unicode compatible?
      ;--CBWord$ + Chr(*scinotify\ch)
    ElseIf *scinotify\ch = 32
      ;--CBWord$ = ""
      TestIt()
      ;EndIf
    EndIf
  ElseIf *scinotify\nmhdr\code = #SCN_MODIFIED
    ;-- Detect modificationType thusly:
    If *scinotify\modificationType & #SC_MOD_INSERTTEXT
      ;Debug "#SC_MOD_INSERTTEXT"
    EndIf
  EndIf
EndProcedure


Procedure GetSciFormat(Gadget)
  Protected Format
  Select ScintillaSendMessage(Gadget, #SCI_GETCODEPAGE)
    Case #SC_CP_UTF8 
      Format = #PB_UTF8
    Case 65001
      Format = #PB_Unicode
    Default
      Format = #PB_Ascii
  EndSelect
  ProcedureReturn Format
EndProcedure
;-- http://en.wikipedia.org/wiki/Byte_order_mark
;-- BOM's seem to be corrupted/missing in some files, use a hex editor to view 
;-- and correct them (insert mode is handy for this)
Procedure GetFileFormat(Gadget, FileHandle)
  Protected Format
  Select ReadStringFormat(FileHandle)
    Case #PB_Unicode
      Format = #PB_Unicode
      ;-- Are these modes the same? see:
      ;-- http://www.scintilla.org/ScintillaDoc.html#SCI_SETCODEPAGE
      ScintillaSendMessage(Gadget, #SCI_SETCODEPAGE, 65001)
      ;ScintillaSendMessage(Gadget, #SCI_SETCODEPAGE, #SC_CP_UTF8)
    Case #PB_UTF8
      Format = #PB_UTF8
      ScintillaSendMessage(Gadget, #SCI_SETCODEPAGE, #SC_CP_UTF8)
    Default
      Format = #PB_Ascii
      ScintillaSendMessage(Gadget, #SCI_SETCODEPAGE, 0)
  EndSelect
  ProcedureReturn Format
EndProcedure

Procedure ClearAllStyling(Gadget)
  ;-- Revert all to default style
  TextLength = ScintillaSendMessage(Gadget, #SCI_GETLENGTH)
  ;-- SCI_STARTSTYLING(int pos, int mask)
  ScintillaSendMessage(Gadget, #SCI_STARTSTYLING, 0, #SCI_STYLE_MASK)
  ScintillaSendMessage(Gadget, #SCI_SETSTYLING, TextLength, #STYLE_DEFAULT);
EndProcedure


;-- Procedure ClearAllText(Gadget)
  ;-- ScintillaSendMessage(Gadget, #SCI_CLEARALL)
;-- EndProcedure


Procedure.s SciGetText(Gadget)
  ;-- Reason for magic + 1: RTFM ScintillaDoc SCI_GETTEXT ;-)
  ;-- Needs to be +2 for unicode, not implemented yet
  ; The number of bytes returned does not include the terminating Null-Character
  ; of the string. The size of the Null-Character is 1 byte for Ascii and
  ; UTF8 mode and 2 bytes for Unicode mode.
  TextLength = ScintillaSendMessage(Gadget, #SCI_GETLENGTH) + 1
  Format = GetSciFormat(Gadget)

  TextBuffer$ = Space(TextLength)
  ScintillaSendMessage(Gadget, #SCI_GETTEXT, TextLength, @TextBuffer$)

  Text$ = PeekS(@TextBuffer$, -1, Format)
  ProcedureReturn Text$
EndProcedure



Procedure SciSetText(Gadget, Text$)
  Format = GetSciFormat(Gadget)
  TextLength = StringByteLength(Text$, Format)
  TextBuffer$ = Space(TextLength)
  PokeS(@TextBuffer$, Text$, -1, Format)
  ScintillaSendMessage(Gadget, #SCI_SETTEXT, 0, @TextBuffer$ )
EndProcedure


Procedure SciLoadFile(Gadget)
  FileName$ = OpenFileRequester("Open a file...", "", "Text (*.txt)|*.txt|All files (*.*)|*.*", 0)
  If FileName$ <> ""
    FileHandle = ReadFile(#PB_Any, FileName$)
    Format = GetFileFormat(Gadget, FileHandle)
    If FileHandle <> 0
      FileLength = Lof(FileHandle)
      *FileBuffer = AllocateMemory(FileLength)
      If *FileBuffer
        ReadData(FileHandle, *FileBuffer, FileLength)

        Text$ = PeekS(*FileBuffer, FileLength, Format)
        SciSetText(Gadget, Text$)

        FreeMemory(*FileBuffer)
        CloseFile(FileHandle)
      EndIf

    EndIf
  EndIf
EndProcedure


Procedure SetMatchedHighlighted(Gadget, Match$)
  ;-- These check out ok, so positions must be wrong
  ;-- Debug "Match: " + Match$
  Format = GetSciFormat(Gadget)
  ;WordLength = Len(Match$)
  WordLength = StringByteLength(Match$, Format)
  Text$ = SciGetText(Gadget)

  Pos = 1
  Repeat
    Pos = FindString(Text$, Match$, Pos)
    If Pos
      SubString$ = Left(Text$, Pos-1)
      TruePos = StringByteLength(SubString$, Format)
      ;Select Format
      ;    Case #PB_Ascii
      ;      ;Debug SubString$
      ;      Debug "#PB_Ascii " + Str(StringByteLength(SubString$, #PB_Ascii))
      ;    Case #PB_UTF8
      ;      Debug "#PB_UTF8 " + Str(StringByteLength(SubString$, #PB_UTF8))
      ;    Case #PB_Unicode
      ;      Debug "#PB_Unicode " + Str(StringByteLength(SubString$, #PB_Unicode))
      ;EndSelect
      ScintillaSendMessage(Gadget, #SCI_STARTSTYLING, TruePos, #SCI_STYLE_MASK)
      ScintillaSendMessage(Gadget, #SCI_SETSTYLING, WordLength, #STYLE_LEXERSTATE_MATCH)
      ScintillaSendMessage(Gadget, #SCI_GOTOPOS, TruePos + WordLength)
      ;-- Pos for next pass
      Pos = Pos + WordLength
    EndIf
  Until Pos = 0
EndProcedure


;---------- Save and Load regex
Procedure LoadRegex()
  Global NewList SavedRegex.STOREDREGEX()
  FileName$ = "SavedRegex.txt"
  ;-- Check for max file size here
  If FileSize(FileName$) > 10485760
    MessageRequester("Information","Maximum file size of 10Mb exceeded.")
  EndIf
  If ReadFile(0, FileName$)
    While Eof(0) = 0
      Line$ = ReadString(0)
      AddElement(SavedRegex())
      SavedRegex()\ID$ = StringField(Line$, 1, #US$)
      SavedRegex()\RegEx$ = StringField(Line$, 2, #US$)
      SavedRegex()\Description$ = StringField(Line$, 3, #US$)
    Wend
    CloseFile(0)
  Else
    MessageRequester("Information","Couldn't open the file!")
  EndIf
  ClearGadgetItems(ListIconRegex)
  SortStructuredList(SavedRegex(), #PB_Sort_Ascending, OffsetOf(STOREDREGEX\Description$), #PB_Sort_String)
  ResetList(SavedRegex())
  While NextElement(SavedRegex())
    AddGadgetItem(ListIconRegex, -1, SavedRegex()\Description$ + Chr(10)+SavedRegex()\RegEx$)
  Wend
EndProcedure


Procedure SaveRegex(Description$)
  FileName$ = "SavedRegex.txt"
  ID$ = Hex(Date() + ElapsedMilliseconds())
  RegEx$ = GetGadgetText(StringRegEx)
  If OpenFile(0, FileName$)
    FileSeek(0, Lof(0))
    WriteStringN(0, ID$ + #US$ + RegEx$ + #US$ + Description$ + #US$)
    CloseFile(0)
  EndIf
  LoadRegex()
EndProcedure


Procedure OpenWindowLibrary()
  If IsWindow(WindowLibrary) <> 0
    HideWindow(WindowLibrary, 0)
    SetActiveWindow(WindowLibrary)
  Else
    FileOpenFlags = #PB_Window_Tool | #PB_Window_SystemMenu
    FileOpenFlags | #PB_Window_WindowCentered
    WindowLibrary = OpenWindow(#PB_Any, 0, 0, 400, 400, "Library", FileOpenFlags, WindowID(WindowMain))
    If WindowLibrary
      ListIconRegex = ListIconGadget(#PB_Any, 0, 20, 400, 380, "Description", 200)
      AddGadgetColumn(ListIconRegex, 1, "Regex", 200)
      LoadRegex()
    EndIf
  EndIf
EndProcedure

Last edited by fiver on Tue Jun 12, 2012 2:07 pm, edited 4 times in total.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Yet another Regular Expression tester ... sorry(!)

Post by IdeasVacuum »

Nice one! 8)
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Yet another Regular Expression tester ... sorry(!)

Post by IdeasVacuum »

Problem: PB4.61 WinXP 32bit

On start-up:
RegExTesterProcs.pb Line 104 [ERROR] Invalid memory access (read error at address 0)

In the procedure, isn't TextLength always going to be > zero?

Code: Select all

Procedure.s GetAllText(Gadget)
  TextLength = ScintillaSendMessage(Gadget, #SCI_GETLENGTH) + 1
  If TextLength > 0
    *TextBuffer=AllocateMemory(TextLength)
    ScintillaSendMessage(Gadget, #SCI_GETTEXT, TextLength, *TextBuffer)
    Text$ = PeekS(*TextBuffer)
    FreeMemory(*TextBuffer)
  EndIf
  ProcedureReturn Text$
EndProcedure
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
fiver
User
User
Posts: 36
Joined: Wed May 05, 2004 8:21 pm
Location: An outer spiral arm of the Milky Way

Re: Yet another Regular Expression tester ... sorry(!)

Post by fiver »

@IdeasVacuum
Sorry about that :oops:
Fixed now see code edits above, tested on XP W32.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Yet another Regular Expression tester ... sorry(!)

Post by IdeasVacuum »

whoops.....invalid memory access on start up: RegExTesterProcs.pb Procedure GetAllText(): *TextBuffer=AllocateMemory(TextLength + 1) :|
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
fiver
User
User
Posts: 36
Joined: Wed May 05, 2004 8:21 pm
Location: An outer spiral arm of the Milky Way

Re: Yet another Regular Expression tester ... sorry(!)

Post by fiver »

IdeasVacuum wrote:whoops.....invalid memory access on start up: RegExTesterProcs.pb Procedure GetAllText(): *TextBuffer=AllocateMemory(TextLength + 1) :|
:?

Strange, as that stands surely it's just a memory allocation so in itself can't be causing the problem, are you using the code above unedited?

(shuffles off to test ...)

I just copied and pasted it into PB on XP and on a Vista virtual machine and it seemed to work fine on both, so I don't know what to suggest.
User avatar
Danilo
Addict
Addict
Posts: 3036
Joined: Sat Apr 26, 2003 8:26 am
Location: Planet Earth

Re: Yet another Regular Expression tester ... sorry(!)

Post by Danilo »

fiver wrote:I just copied and pasted it into PB on XP and on a Vista virtual machine and it seemed to work fine on both, so I don't know what to suggest.
Suggest to disable UNICODE. ;)

Requires some AllocateMemory( TextLength * SizeOf(Character) ) and more to work correctly with Unicode.
User avatar
fiver
User
User
Posts: 36
Joined: Wed May 05, 2004 8:21 pm
Location: An outer spiral arm of the Milky Way

Re: Yet another Regular Expression tester ... sorry(!)

Post by fiver »

Danilo wrote:
fiver wrote:I just copied and pasted it into PB on XP and on a Vista virtual machine and it seemed to work fine on both, so I don't know what to suggest.
Suggest to disable UNICODE. ;)

Requires some AllocateMemory( TextLength * SizeOf(Character) ) and more to work correctly with Unicode.
Thanks Danilo, note added. I should also thank you for your nifty use of repeat and stringfield that I purloined from your posts on PB IDE Smart Highlight
:)
User avatar
fiver
User
User
Posts: 36
Joined: Wed May 05, 2004 8:21 pm
Location: An outer spiral arm of the Milky Way

Re: Yet another Regular Expression tester [adapted for Unico

Post by fiver »

Altered to support unicode and removed the second superfluous scintilla gadget, code above edited accordingly.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Yet another Regular Expression tester [adapted for Unico

Post by IdeasVacuum »

.....ahhh. If I had a £ for everytime I forgot the Unicode setting I'd be very rich by now! :oops:
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
Amundo
Enthusiast
Enthusiast
Posts: 200
Joined: Thu Feb 16, 2006 1:41 am
Location: New Zealand

Re: Yet another Regular Expression tester [adapted for Unico

Post by Amundo »

Whoa there crazy horse, thanks fiver, very helpful, thanks for sharing!
Win10, PB6.x, okayish CPU, onboard video card, fuzzy monitor (or is that my eyesight?)
"When the facts change, I change my mind" - John Maynard Keynes
Post Reply