Page 1 of 1

Stemming (?)

Posted: Thu Feb 29, 2024 4:42 pm
by AZJIO
Has anyone done this algorithm?
Using a regular expression this is easy to do, but I would like to do this without regular expressions.
Here is the algorithm for the Russian language.
Finding vowels is easy [аеиоуыэюя]
Converting a regular expression like this ([ая])(в|вши|вшись)$ is already more difficult. I need to move the pointer to the end of the line, and check from the end for a match of the token and check one of the two letters before this token.
Here's for English: link1

I adapted the algorithm for AkelPad to use it to create an auto-completion list in a help file without resorting to online resources. But now I would like to use it in TextCorrection for the functionality of replacing abbreviations and calque.

Code: Select all

Define length, start$, rv$
Define *c.Character, *g.Character
Define *g0, *c0
Define Text$ = "берётся"

Define RVRE$ = "аеиоуыэюя"
; Define PERFECTIVEGROUND_1 = ""

If FindString(Text$, "ё", 1, #PB_String_NoCase)
	OrigText$ = Text$
	ReplaceString(Text$, "ё", "е", #PB_String_NoCase | #PB_String_InPlace)
EndIf

*c = @Text$
*c0 = *c
length = Len(Text$)
; *c + (length - 1) * SizeOf(Character)


*g = @RVRE$
*g0 = *g

While *c\c
	*g = *g0
	While *g\c
		If *c\c = *g\c
			pos = *c - *c0 + 1
; 			Debug Chr(*c\c) + Chr(*g\c)
; 			Debug pos
			start$ = Mid(Text$, 1, pos - 1)
			rv$ = Mid(Text$, pos)
			Break 2
		EndIf
		*g + SizeOf(Character)
	Wend
	*c + SizeOf(Character)
Wend

Debug start$
Debug rv$

Re: Stemming (?)

Posted: Thu Feb 29, 2024 6:48 pm
by SMaag
I don't understand exactly your goal, but for WideStrings the pos calculation seems to be wrong
I adapted the code a little

Code: Select all

Define length, start$, rv$
Define *c.Character, *g.Character
Define *g0, *c0
Define Text$ = "FFTH is a test"

Define RVRE$ = "aeiou"
; Define PERFECTIVEGROUND_1 = ""

If FindString(Text$, "ё", 1, #PB_String_NoCase)
	OrigText$ = Text$
	ReplaceString(Text$, "ё", "е", #PB_String_NoCase | #PB_String_InPlace)
EndIf

*c = @Text$
*c0 = *c
length = Len(Text$)
; *c + (length - 1) * SizeOf(Character)


*g = @RVRE$
*g0 = *g

While *c\c
	*g = *g0
	While *g\c
		If *c\c = *g\c
		  pos = (*c - *c0)/2 + 1
		  ;pos = *c - *c0 + 1
; 			Debug Chr(*c\c) + Chr(*g\c)
; 			Debug pos
		  ; start$ = Mid(Text$, 1, pos - 1)
		  start$ = Left(Text$, Pos-1)
			rv$ = Mid(Text$, pos)
			Break 2
		EndIf
		*g + SizeOf(Character)
	Wend
	*c + SizeOf(Character)
Wend

Debug start$
Debug rv$

It splites the String at the frist Character which is found in RVRE$.

But I guess you want to echange a List of Chracters! Is this correct?

Re: Stemming (?)

Posted: Thu Feb 29, 2024 8:42 pm
by idle
if you want to strip accents this module does that
https://github.com/idle-PB/UTF16/blob/main/UTF16a.pb

Re: Stemming (?)

Posted: Thu Feb 29, 2024 9:20 pm
by AZJIO
SMaag wrote: Thu Feb 29, 2024 6:48 pmI don't understand exactly your goal
addition
added
adding
These words become "add"
SMaag wrote: Thu Feb 29, 2024 6:48 pmBut I guess you want to echange a List of Chracters! Is this correct?
I think that the first action creates the left side of the word, which will not disappear under any circumstances. The right side of the word is subject to change and with this change at least one letter must remain. When you combine the left and right sides you get the root of the word. This protects words from processing that would leave a empty or a single letter. I may be wrong, but this is my guess with 99% probability.
SMaag wrote: Thu Feb 29, 2024 6:48 pm

Code: Select all

Define Text$ = "FFTH is a test"
There can only be one word here, for example "added". After processing you will receive the word "add".

View online resource (http)
loving
loved
=love

Re: Stemming (?)

Posted: Fri Mar 01, 2024 6:11 am
by idle
you can do that with squint3 load a dictionary and enumerate it from "add"
so it would print "addition", "added", "adding" ...

you could also do it with an end marker so you can control the enumeration depth by passing in the marker to halt at
so you insert "add\" then when you go to insert "added" you lookup "ad" then look up "ad\" then look up "add" and "add\" a root is found then you insert "ed" and the same goes for for "ing" as you found a root word "add\".
The trie is then "add\ed", "add\ing"
if you enumerate "ad" it will halt at "add\" and then you can redo it to get the suffixes. "ed", "ing"
The cost isn't much as you do it using the node pointer from the look up. so the cost is like that of next element of a list.

Re: Stemming (?)

Posted: Fri Mar 01, 2024 7:21 am
by AZJIO
As in many languages, words have different endings depending on what time the events take place, depending on the man or woman, depending on the number (plural or singular). For simplicity, everyone uses slang. For example, I write the word Windows as “винда”. In this case, the ending of the word may change, for example:
винда
винде
винду
виндам
When you press the hotkey, this word changes to Windows. That is, I write in an easy way and turn the text into the correct form, because slang on some sites is not welcome due to the fact that no one understands what they are talking about.
To make a replacement table I have to write the following
винда=Windows
винде=Windows
винду=Windows
виндам=Windows
автоит=AutoIt3
автоиту=AutoIt3
автоита=AutoIt3
автоите=AutoIt3
пурик=PureBasic
пурику=PureBasic
пурика=PureBasic
пурике=PureBasic
акел=AkelPad
But I can do it differently
винд=Windows
автоит=AutoIt3
пурик=PureBasic
I'm doing one line. When the text analyzer processes the word, it returns the word "винд" without the ending. I can freely apply any ending to the word, but it will always be corrected.

Re: Stemming (?)

Posted: Fri Mar 01, 2024 10:18 am
by SMaag
With your link in your 1st post I came to the site wicht lists the stemming alogrithm in many languages.

https://tartarus.org/martin/PorterStemm ... x-old.html
Porting this to PB should be no problem for you!

Or do you search for other functions!