Page 1 of 2

How to recognize alphaMeric characters

Posted: Wed Sep 27, 2023 5:11 pm
by vmars316
TIA ,
I am writing a program that will Read a .txt file
collect and store each word into List ;
then Sort the List
(Here , I plan to TRIM List here , once I learn how to recognize alohaMeric characters)
then Write Sorted List to .txt file .

Now I want to LTrim and RTrim anything that is NOT NlphaMeric ,
A-Z , a-z and 0-9 ,
and consider some common symbols such as @, #, *, and & as NOT AlphaMeric .
Below is my code
(please excuse my simpleton code , I write code this way (step by step)
so that years from now I will be able to understand each step .

So , my question is:
How can I recognize Leading and Trailing alphaMeric characters (as defined above) .

Code: Select all

; Author: Vern Marsden vmars316
; http://www.purebasic.fr/english/
; https://www.purebasic.com/documentation/
; CRLF = +Chr(10)+Chr(34)+
; 
; ReadFile_Words_to_List.pb

EnableExplicit 
Global FileName$="ReadFile_Words_to_List.txt" 
Global NewList WordsList.s() , WordsListCOUNT = 0
; The first list element is at position 0, the next at 1 and so on. 
; ============================================================
If ReadFile(0, FileName$) 
  Global aLine$ , aLineLEN = 0 , BlankPOS = 0 , StartPOS= 1 , LastPOD = 0 ,
         StartMid = 0 , BlankPOS = 0 , MidWord$ , 
         LastWord$ , LastWordLEN = 0  
      ClearList(WordsList()) 
  Repeat ; for each line
    BlankPOS = 0 : StartPOS= 1 
    aLine$ = ReadString(0)   ;   Read Next Line
;    Debug aline$   
    aLineLEN = Len(aLine$)
   
    ; parse/collect wordscl 
    Repeat ; for each word 
      BlankPOS = FindString(aLine$, " " , StartPOS , #PB_String_NoCase)
      If BlankPOS > 0 
        MidWord$ = Mid(aLine$, StartPOS , BlankPOS - StartPOS + 1 )
        
        AddElement(WordsList())
        WordsList() = MidWord$
        
        WordsListCOUNT = WordsListCOUNT + 1
        StartPOS = BlankPOS + 1
        
      EndIf 
    Until BlankPOS < 1   ;   
; ============================================================
    If BlankPOS < 1  ;  There are no more Blanks/Spaces 
      LastWordLEN = aLineLEN - StartPOS + 1
      LastWord$ = Mid(aLine$, StartPOS , LastWordLEN + 1)

        AddElement(WordsList())
        WordsList() = LastWord$
      
      EndIf 
;    Debug aline$   
  Until Eof(0) ;  eol 
  CloseFile(0) 
; ============================================================
  SortList(WordsList(),#PB_Sort_Ascending  | #PB_Sort_NoCase)
  MessageRequester("Information", "There are "+Str(ListSize(WordsList()))+" elements in the list", #PB_MessageRequester_Ok)
  
  ; ============================================================
; Trim code goes here .  
; ============================================================
  
  CreateFile(2, "Words_List_OUT.txt", #PB_UTF8)  

  ForEach WordsList()
    Debug WordsList()
      WriteStringN(2, WordsList() , #PB_UTF8)  ; "Words_List_OUT.txt"
  Next
EndIf
; ============================================================
End 

Re: How to recognize alphaMeric characters

Posted: Wed Sep 27, 2023 6:16 pm
by Mr.L
Hello,
maybe a Regular Expression can help?
greetings

Code: Select all

CreateRegularExpression(0, "\S+")

aLine$ = "   1234   #5678 %10101 $01ac"

ExamineRegularExpression(0, aLine$)
While NextRegularExpressionMatch(0)
	Debug RegularExpressionMatchString(0)
Wend

Re: How to recognize alphaMeric characters

Posted: Wed Sep 27, 2023 7:05 pm
by Olli
Regular expressions are a good choice, I do not prevent nobody to use.

But I prefer the manual way.

My method :
1) consider a file
2) load hardly this file (the most quick as possible)
2a) by one unique part if memory is okay
2b) by any parts if it is a big file
3) first filter : the comments

All that is purely useless, as comments, must be ignored.
Several comment markups should be studied :
3a) from a one-character starting markup to the end of line
3b) from a combo-characters starting markup to the EOL
3c) from a decorated alphabetic markup to the EOL
(here, problem of upper/lower case)
3d) from a 1-character starting markup
to a same 1-character ending markup (flip-flop)
3e) from a 1-character starting markup
to a different 1-character ending markup
3f) from a combo-characters starting markup
to a same combo-characters ending markup
3g) from a combo-characters starting markup
to a different combo-charactets ending markup
3h) from a decorated alphabetic starting markup
to a same decorated alphabetic ending markup
3i) from a decorated alphabetic starting markup
to a different decorated alphabetic ending markup

4) second filter : the sub-jacents datas

Sub-jacents datas are datas which could be useful, but not useful for your technical research. These datas must be ignored, but not excluded from a miscenalleous countage (if an english understands this last expression, let's light a candle on and let's cry our happiness...).

5) now, the main system : a stats robot. This will read the file (loaded in memory), character per character, enable or disable several variables, mainly through these values :
5a) Nobody knows if Mode X is enabled or disabled
5b) Mode X is disabled
5c) Mode X is enabled
5d) Mode X is disabled and in a conditionnal intermediate state to be enabled
5e) Mode X is enabled and in a conditionnal intermediate state to be disabled

5f) This (the several states of a variable) could be translated by << yes, no, maybe >>, for simple states, and << yes, yes if condition, no, no if condition, maybe >>, for complex states, and, and... and... << yes, yes if condition A, yes if condition B, no, no if condition C, no if condition D, maybe >> for more complex stats

5g) Note that a "condition" is (simply) the status of an other variable.

6) The real process

6a) Read a character and convert it to its codage identity.
6b) Is it a system character ? c < 32
6c) Is it an 7-bits ASCII character ? c < 128
6d) Is it an 8-bits ASCII character ? c < 256
6e) etc... (you have all the beaches, chinese, japan, cyrillic, etc... in the unicode table)

6f) So, anywhere, your first stat variables will be this :

Code: Select all

Define.i systemCharacter
Define.i ascii7bitsCharacter
Define.i ascii8bitsCharacter
6g) Note that a ascii7character is not a systemCharacter

7) After a first level of tests, same mechanism in a second level. Example : the ascii7character is #True

7a) Is it a decimal digit ?
7b) Is it a alphabetic lower case letter ?
7c) Is it a alphabetic upper case letter ?

Do you want a example code ?

Re: How to recognize alphaMeric characters

Posted: Wed Sep 27, 2023 8:52 pm
by SMaag
We had a discussion a short time ago how to remove non word characters.
maybe it would be better to remove non word characters bevor counting words.

here the link to the disussion

viewtopic.php?p=604821&hilit=SMaag#p604821

I've got to study this a while ; I'll be back .

Posted: Wed Sep 27, 2023 10:32 pm
by vmars316
Wow , Great response People , Thank You :
I've got to study this a while ; I'll be back .

Re: How to recognize alphaMeric characters

Posted: Thu Sep 28, 2023 4:31 am
by Olli
vmars316 wrote:So , my question is:
How can I recognize Leading and Trailing alphaMeric characters (as defined above) .
Use prime numbers.

Re: How to recognize alphaMeric characters

Posted: Thu Sep 28, 2023 5:38 pm
by vmars316
MrL , Awesome ,
Thanks for the regex code .
It works great except for a few things:
(Rom.
12:10).
6:5–6,
“obedience,”
(which
?
How can this be fixed .
I rarely use regex , so I am not good at it .
Or should I start a new Post for 'regex help' .
Thanks

Re: How to recognize alphaMeric characters

Posted: Thu Sep 28, 2023 9:48 pm
by Mr.L
Hi, vmars316!
can you explain how it should look?

Re: How to recognize alphaMeric characters

Posted: Thu Sep 28, 2023 10:09 pm
by vmars316
Sure .
Output should look like:

Code: Select all

a
abandon
abilities
able
about
above
above,
abundant
abuse
acceptable
acceptable
acceptable
according
according
account
achieve
acknowledge
act
act
acting
action
actions.
actions.
admire
admiring
admonishes

It looks like:

Code: Select all

1
1
12:10).
12:1–2
13:16).
16).
16:24
16:2–3).
1:17
2
2
21:31–32
23:19
2:11–14).
2:11–14,
2:20).
2:3
4:10
4:23).
4:7).
6:10
6:5–6,
?
?
a
abandon
abilities


Re: How to recognize alphaMeric characters

Posted: Fri Sep 29, 2023 2:07 am
by vmars316
Mr.L ;
I don't know if this Helps
but below is my code for that section ,
and here is the input file (quite large):
https://vmars.us/ShowMe/Little-book-on ... -Life.txt

Code: Select all

; ============================================================
; Trim code goes here .  
; ===============================================================
  Global TrimLetter$ , TriMMedWord$ = "" , TrimLetterNumeric = 0 , 
         aLineOUT$ = ""
  FileName$ = "Words_List_OUT.txt"
If ReadFile(0, FileName$) 
  ClearList(WordsList()) 
  Debug "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"
  Repeat ; for each Word
    aLine$ = ReadString(0)   ;   Read Next Line
;    Debug aline$   
    aLineLEN = Len(aLine$)
   
; ============================================================
    CreateRegularExpression(0, "\S+")
 ExamineRegularExpression(0, aLine$)
 While NextRegularExpressionMatch(0)
   alineOUT$ = RegularExpressionMatchString(0)
   Debug aLineOUT$
 Wend
; ============================================================
         AddElement(WordsList())
         WordsList() = alineOUT$
; ============================================================ 
 ;    Debug aline$   
  Until Eof(0) ;  eol 

  CloseFile(0) 

; ============================================================
  
  CreateFile(2, "Words_List_OUT_AGAIN.txt", #PB_UTF8)  

  ForEach WordsList()
    Debug WordsList()
      WriteStringN(2, WordsList() , #PB_UTF8)  ; "Words_List_OUT.txt"
  Next
EndIf
; ============================================================
; CreateRegularExpression(0, "\S+")
; 
; aLine$ = "   1234   #5678 %10101 $01ac"
; 
; ExamineRegularExpression(0, aLine$)
; While NextRegularExpressionMatch(0)
; 	Debug RegularExpressionMatchString(0)
; Wend
; ============================================================

EndIf 
End 

Re: How to recognize alphaMeric characters

Posted: Fri Sep 29, 2023 6:52 am
by Mr.L
...maybe a combination of RegEx and "manual" processing is a better solution.

Code: Select all

NewList WordsList.s()

; ============================================================
; Trim code goes here .  
; ===============================================================

; a list of characters, that are allowed at the beginning of a word
Global AllowedCharactersBegin$ = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
; a list of characters, that are allowed at the end of a word
Global AllowedCharactersEnd$ = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

CreateRegularExpression(0, "\S+")

FileName$ = "C:\Users\User\Downloads\Little-book-on-the-Christian-Life.txt"
If ReadFile(0, FileName$) 
	ClearList(WordsList()) 
	
	Repeat ; for each Word
		aLine$ = ReadString(0)   ;   Read Next Line
								 ;    Debug aline$   
		
		; ============================================================
		If ExamineRegularExpression(0, aLine$)
			While NextRegularExpressionMatch(0)
				word.s = RegularExpressionMatchString(0)
				; ...further processing
				
				; trim not allowed characters at the beginning of the word
				While word <> "" And FindString(AllowedCharactersBegin$, Left(word, 1)) = 0
					word = Mid(word, 2)
				Wend
				
				; trim not allowed characters at the end of the word
				While word <> "" And FindString(AllowedCharactersEnd$, Right(word, 1)) = 0
					word = Mid(word, 1, Len(word) - 1)
				Wend
				
				If word <> ""
					AddElement(WordsList())
					WordsList() = word
				EndIf
			Wend
		EndIf	
		
		; ============================================================ 
	Until Eof(0) ;  eol 
	
	CloseFile(0) 
	
	; ============================================================
	
 	CreateFile(2, "Words_List_OUT_AGAIN.txt", #PB_UTF8)
	SortList(WordsList(), #PB_Sort_Ascending)
	ForEach WordsList()
		WriteStringN(2, WordsList())  ; "Words_List_OUT.txt"
	Next
EndIf
End 

Re: How to recognize alphaMeric characters

Posted: Fri Sep 29, 2023 4:35 pm
by Marc56us
Hi,
I suggest the following (based on one of my previous RegEx). I added exclusion for digits only.
No need Trim, RegEx do all the job.

Code: Select all

NewList WordsList.s()

CreateRegularExpression(0, "[^ ,;.!?()*:\t\r\n\d]+")

FileName$ = "Little-book-on-the-Christian-Life.txt"
If ReadFile(0, FileName$) 
    ClearList(WordsList()) 
    
    Repeat ; for each Word
        aLine$ = ReadString(0)   
        ExamineRegularExpression(0, aLine$)
        While NextRegularExpressionMatch(0)
            word.s = RegularExpressionMatchString(0)
            AddElement(WordsList())
            WordsList() = word  
        Wend
    Until Eof(0)  
    CloseFile(0)   
    
    CreateFile(2, "Words_List_OUT_AGAIN.txt", #PB_UTF8)
    SortList(WordsList(), #PB_Sort_Ascending)
    ForEach WordsList()
        If WordsList()
            WriteStringN(2, WordsList())  ; "Words_List_OUT.txt"
        EndIf
    Next
EndIf
End 
Curiously enough, my RegEx seems to exclude apostrophes when I don't ask it to. I'll have a look later. :?
:wink:

Re: How to recognize alphaMeric characters

Posted: Fri Sep 29, 2023 7:10 pm
by AZJIO
You can use this method
SplitL2 viewtopic.php?p=585485#p585485
IsLatin viewtopic.php?p=583353#p583353

Code: Select all

; AZJIO
; https://www.purebasic.fr/english/viewtopic.php?p=608408#p608408
Procedure SplitListByWords(*c.Character, List StringList.s())
	Protected *S.Integer
	*S = *c
	
	ClearList(StringList())
	
	If *c = 0 Or *c\c = 0
		ProcedureReturn 0
	EndIf
	
	While *c\c
		
		If Not ((*c\c >= 'a' And *c\c <= 'z') Or (*c\c >= 'A' And *c\c <= 'Z'))
			*c\c = 0
			If *S <> *c
				AddElement(StringList())
				StringList() = PeekS(*S)
				; StringList() = PeekS(*S, (*c - *S) >> 1)
			EndIf
			*S = *c + SizeOf(Character)
		EndIf
		
		*c + SizeOf(Character)
	Wend
	*c - SizeOf(Character)
	If ((*c\c >= 'a' And *c\c <= 'z') Or (*c\c >= 'A' And *c\c <= 'Z'))
		AddElement(StringList())
		StringList() = PeekS(*S)
	EndIf
EndProcedure

Define S.s = "This is    a test              string               ,#@%^              to                .?<>\             see                if    split are    working."
; Define S.s = " This "
Define NewList MyStrings.s()

; the variable will be modified after the function call
SplitListByWords(@S, MyStrings())

ForEach MyStrings()
    Debug "|" + MyStrings() + "|"
Next

Re: How to recognize alphaMeric characters

Posted: Fri Sep 29, 2023 10:18 pm
by vmars316
Mr.L ,
Awesome , Thank You Very Much...

I like your AllowedCharactersBegin$
and AllowedCharactersEnd$

This program will eventually end up as a Procedure ('Create Index' , Button)
in a larger Freeware utility program called "Line-By-Line" .
Here is a SnapShot of that program so far .
https://vmars.us/ShowMe/Line-By-Line-ScreenShot.png

All the questions I have asked over the last year are for this 'Line-By-Line' program .
Reminds me of the Peete Seeger song "Inch by Inch , Row by Row" .
So , Thanks to All.

Next step: write Procedure to 'RemoveDuplicateLines' .

Re: How to recognize alphaMeric characters

Posted: Fri Sep 29, 2023 10:22 pm
by AZJIO
vmars316
You can also check the speed of functions and the increase in program size when inserting code. For example, regular expressions add 150 KB to the program.
Removing duplicates should be another topic and has already been discussed many times.

Code: Select all

; AZJIO
; https://www.purebasic.fr/english/viewtopic.php?p=608430#p608430
Procedure SplitMapByWords(*c.Character, Map StringMap.s(), CaseSensitive = 1)
	Protected *S.Character
	*S = *c
	ClearMap(StringMap())
	If *c = 0 Or *c\c = 0
		ProcedureReturn 0
	EndIf
	While *c\c
		If Not ((*c\c >= 'a' And *c\c <= 'z') Or (*c\c >= 'A' And *c\c <= 'Z'))
			*c\c = 0
			If *S <> *c
				If CaseSensitive
					AddMapElement(StringMap(), PeekS(*S))
				Else
					AddMapElement(StringMap(), LCase(PeekS(*S)))
					StringMap() = PeekS(*S)
				EndIf
			EndIf
			*S = *c + SizeOf(Character)
		EndIf
		*c + SizeOf(Character)
	Wend
	*c - SizeOf(Character)
	If ((*c\c >= 'a' And *c\c <= 'z') Or (*c\c >= 'A' And *c\c <= 'Z'))
		If CaseSensitive
			AddMapElement(StringMap(), PeekS(*S))
		Else
			AddMapElement(StringMap(), LCase(PeekS(*S)))
			StringMap() = PeekS(*S)
		EndIf
	EndIf
EndProcedure

Define S.s = "This is see see see is see see see is this"
Define NewMap MyStrings.s()

; the variable will be modified after the function call
SplitMapByWords(@S, MyStrings(), 0)

ForEach MyStrings()
	Debug "|" + MapKey(MyStrings()) + "|"
Next