Convert all special chars in a text into regular letters...

Just starting out? Need help? Post your questions and find answers here.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert all special chars in a text into regular letters

Post by wilbert »

Andre wrote:I just need to adapt my project code, because now the converted string is returned instead the previously used 'in-place' conversion.
I can give you an in-place conversion also but in that case you can only do a 1 > 1 conversion.
If in real world you need both 1 > 1 and 1 > 2 conversions, the solution I offered is a better one compared to doing a 1 > 1 in-place conversion with an additional ReplaceString for 1 > 2 conversions.
Andre wrote:But I will report here, how it works when it's done.
That would be great :)
Would be nice to hear about the performance difference in your real world application.
Windows (x64)
Raspberry Pi OS (Arm64)
Bitblazer
Enthusiast
Enthusiast
Posts: 733
Joined: Mon Apr 10, 2017 6:17 pm
Location: Germany
Contact:

Re: Convert all special chars in a text into regular letters

Post by Bitblazer »

Andre wrote:I'm sure I've seen something before, but I couldn't find any useful code example... :oops:

I want to have a small function, which is able to convert all special chars occuring in longer text strings into their regular letters.

For example take the text and look for all "special chars" listed in the first line, and convert them into their respective counterparts listed on the second line:

Code: Select all

   ÄÂäãåæçÐÉÈéèïíîñÖöøÞŠšÜüÚúý
   AAaaaacDEEeeiiInOoopSsUeUuy
Of course I can do this using a loop with several ReplaceString() and similar calls, but I hope there is a faster way (usable for long texts / thousands of text strings....
I don't know how extremely into detail you want to go, and this can be a rather very technical question. Excuse me if i just scratch the surface here, if i should go into more detail about something, just ask :).

Ok, you mention long texts and thousands of text strings for this task and you want the best speed. For example my CPU has 32 KB data cache, 256 KB L2 cache and 6 MB L3 cache. For simplification lets say "the closer to the CPU a cache is, the faster it is" and if you go beyond the capacity of the CPU caches, you will have a performance which is magnitudes slower (we talk about several gigabytes per second vs a few hundred megabytes in some cases.

Once you go beyond the DRAM bandwidth, it gets frustratingly slow as we talk about hard disk or SSD performance here which will be down to 110 MB/second for a physical hard disk (remember that this is still the fastest case, a filesystem and physical head movement due to repositioning, can easily lower the performance down to 50/60 MB/s.

What i am trying to say, is that once your data needs more memory than the CPU caches provide, your processing speed will depend so much on the slow hard disk / SDD bandwith, that it doesnt matter at all if you have super fast tight code which is sitting in L1 cache or bloated .NET code which uses IL interpretion. The waiting will be for the data anyway - so i don't see point to optimize the code here? (ofc. you could do it for principle, but for practical purposes it wont make much of a difference).

For academic purposes : a straight LUT approach would be very fast and elegant (the source char could be used as an index into a LUT so the operation would be like *string[x] = LUT[*string[x]] - so in assembler it would be a 2/3 machine code operation per character depending on the CPU architecture - "operand memory, memory" is an interesting technical problem / limitation on some architectures), but it would be a pointless optimization regarding the real world speed of your character conversion.

A few links with further information:
https://en.wikipedia.org/wiki/Cache_per ... and_metric
https://github.com/opcm/pcm
https://en.wikipedia.org/wiki/CPU_cache
DRAM memory bandwith https://en.wikipedia.org/wiki/Dynamic_r ... ess_memory

Tools to collect further info:
https://www.aida64.de
https://github.com/opcm/pcm
webpage - discord chat links -> purebasic GPT4All
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Convert all special chars in a text into regular letters

Post by Andre »

wilbert wrote:I can give you an in-place conversion also but in that case you can only do a 1 > 1 conversion.
If in real world you need both 1 > 1 and 1 > 2 conversions, the solution I offered is a better one compared to doing a 1 > 1 in-place conversion with an additional ReplaceString for 1 > 2 conversions.
If I'm considering your last comment, the 'in-place conversion' shouldn't be needed, as for all sorting and searching functions the complete conversion of umlauts and other special chars should always be used.

So I'm just started implementing your latest ASM functions (with multiple conversion tables support) into my project, and adapt all calling routines according to the new functions.
I'll keep you informed! 8)

@Bitblazer: thanks for your additional comments / website links - I will take a closer look, if I'm running into problems... :wink:
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Convert all special chars in a text into regular letters

Post by Andre »

@wilbert:
Today I finished the integration of the latest ASM funtions (I'm using ConvertCharsPtr() now) into my project.
It runs very very fast, even with thousands of data records to convert (for sorting and searching). I haven't done any exact speed tests, as I have also done some rework on my own code.

For "only" converting (german) umlauts, because I've restricted my search-function to only accept plain chars and umlauts (while names with special chars will be found too, as they are converted to plain chars when searching...) I've added the following code to the Init_ConversionTables() function:

Code: Select all

  Global Dim CT_Umlauts.l(65535)
 
  For c = 1 To 65535
    CT_SpecialChars(c) = c
    CT_Lowercase(c) = c
    CT_Uppercase(c) = c
    CT_Umlauts(c) = c
  Next
  
  ; Umlauts  (only needed for conversion of 'Searchterms' in the GeoWorld Searchfunction, while the complete searchstrings to compare with will be fully converted using CT_SpecialChars())
  
  Org  = "Ä Ö Ü ä ö ü ß "
  Conv = "AeOeUeaeoeuess"
  
  *Org = @Org : *Conv = @Conv
  While *Org\u
    o = *Org\u  : *Org  + 4
    c = *Conv\l : *Conv + 4
    If c >> 16 = 32
      c & $ffff
    EndIf
    CT_Umlauts(o) = c
  Wend
Only one problem seems to be left:
If I'm doing heavy 'stress tests' with full-text search, for which all data fields (even larger strings) per item are combined to a large PB String + and later completely converted by ConvertCharsPtr(), I run into a 'Invalid Memory access...' error at this line of ConvertCharsPtr()

Code: Select all

  Protected Dim OutputBuffer.l(MemoryStringLength(*String))
It seems to happen especially, if I'm doing one full-text search for >80,000 data items (= creating one large string per item, converting all special chars by ConvertCharsPtr), and after that doing a more simple search in single data fields (with short strings then) for this number of items...

I have no short test code at the moment... but do you see any points to make the function more 'bullet-proof', or is there something I should be aware of when calling the conversion procedure?
(I'm already aware of to call the conversion function only with strings > 0 bytes, and wasn't using strings >65 KB before the crash happened.)

Beside that: awesome work, thank you very much! :D
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert all special chars in a text into regular letters

Post by wilbert »

Glad to hear it is working fast Andre :)

As for the Invalid Memory access, what you could do to test is split the line into

Code: Select all

  Protected slen.i = MemoryStringLength(*String)
  Protected Dim OutputBuffer.l(slen)
and see where the error is now.
If it's the MemoryStringLength which is causing problems, the pointer you are passing is probably wrong.
If it's not, check the value MemoryStringLength returns to see if something is strange with this value.

When you do your full text search, is there any reason you are combining into one large string instead of going over all fields one after the other ?
I'm also wondering when you do a search, if the position is relevant to you or do you only need to know if a certain string contains a text you are looking for ?
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Convert all special chars in a text into regular letters

Post by Andre »

Thank you, wilbert! :D

I just did further tests and found the following:
The ConvertCharPtr() function don't have a problem itself, it crashes at the call of MemoryStringLength() because a string pointer of an empty string was given a parameter. It would be easy to catch this, if including a test before...
But as this would be an important speed impact, such a test must be done before calling the ConvertCharPtr() function and only if really needed.

Using this "workaround" is solving my problem.
But I searched further: as the crash only happened, when switching between the "single search" (comparing only specific data-fields) and my implementation of a "full-text search" and back to a "single search", I noticed, that my Map of (e.g. Cities) >75,000 elements is increased by one element after the full-text search. And this one additional element contains only empty structure fields, finally causing the crash when calling the ConvertCharsPtr() function on an empty search string...

I don't find a mistake in my code, the only peculiarity in my full-text search is, that's not only a simple loop through all map elements (which the "single search" is), but also use a pair of Push/PopMapPosition in the meantime. But from my point of view this shouldn't increase the map itself. (It's increased by one element also only one time, even if calling the full-text search several times.)
I need to try, if I could isolate this problem in a small test code. Maybe I've found a bug in some special circumtances...

About your question about the full-text search:
It's just my implementation of it, "collecting" the content of all available data fields for one element (city, country, etc.) and the comparing this (probably large) string with one or several searchterms, according to selected search-mode "contains all", "contains one", "contains none...", etc. As it's working well, I didn't thought about any change to it... :wink:

Beside of that, the single-search is comparing each individual data field, if there is a search term for it... (again according to selected search-mode, this time different for text and numeric fields).

If I'm remembering right, I am oriented towards office programs (e.g. MS Excel) and their search possibilities... :mrgreen:
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert all special chars in a text into regular letters

Post by wilbert »

Andre wrote:I noticed, that my Map of (e.g. Cities) >75,000 elements is increased by one element after the full-text search. And this one additional element contains only empty structure fields, finally causing the crash when calling the ConvertCharsPtr() function on an empty search string...
Maybe you are accidentally assigning something to a map element with an empty string for key ?

Code: Select all

NewMap Country.s()
Country("US") = "United States"
Debug MapSize(Country())
Country("") = "empty"
Debug MapSize(Country())
Andre wrote:About your question about the full-text search:
It's just my implementation of it, "collecting" the content of all available data fields for one element (city, country, etc.) and the comparing this (probably large) string with one or several searchterms, according to selected search-mode "contains all", "contains one", "contains none...", etc. As it's working well, I didn't thought about any change to it... :wink:
I asked it because combining strings with the + operator can be slow.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Convert all special chars in a text into regular letters

Post by Andre »

wilbert wrote: Maybe you are accidentally assigning something to a map element with an empty string for key ?

Code: Select all

NewMap Country.s()
Country("US") = "United States"
Debug MapSize(Country())
Country("") = "empty"
Debug MapSize(Country())
Yes, this seems to be true. I already had this idea, when I went to bed last night.
But this evening I'm still debugging to find the causing piece of code (I already have an idea...)

When I see this problem and very hard to find bug (I would never have thought, that there could be a map element with an empty key), I'm thinking that a Compiler error (at least a warning) should be raised, if the programmer tries to use an empty key...
If not a bug in PB (at least unwanted feature), I would go for a feature request to forbid empty keys in PB maps. What do you think? :D
(I can't imagine, that someone really need this!?)
wilbert wrote:
Andre wrote:About your question about the full-text search:
It's just my implementation of it, "collecting" the content of all available data fields for one element (city, country, etc.) and the comparing this (probably large) string with one or several searchterms, according to selected search-mode "contains all", "contains one", "contains none...", etc. As it's working well, I didn't thought about any change to it... :wink:
I asked it because combining strings with the + operator can be slow.
Maybe I need to search for a faster variant, if the databases have grown. For the moment it's fast enough... :-)
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert all special chars in a text into regular letters

Post by wilbert »

Andre wrote:When I see this problem and very hard to find bug (I would never have thought, that there could be a map element with an empty key), I'm thinking that a Compiler error (at least a warning) should be raised, if the programmer tries to use an empty key...
If not a bug in PB (at least unwanted feature), I would go for a feature request to forbid empty keys in PB maps. What do you think? :D
(I can't imagine, that someone really need this!?)
I probably wouldn't use an empty key but maybe someone else wants to.
An error seems to much but a warning might be convenient.

Since your problem with the character conversion procedure was related to an uninitialized string (null pointer), it might be possible this is also the problem in this case.
PureBasic not only allows for empty strings as key but even for uninitialized strings.

Code: Select all

NewMap Country.s()

A.s
Debug @A; null pointer

Country(A) = "My country"
Debug Country("")
Andre wrote:Maybe I need to search for a faster variant, if the databases have grown. For the moment it's fast enough... :-)
If it's fast enough there's indeed no need to optimize :)
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Convert all special chars in a text into regular letters

Post by Andre »

Thank you, wilbert, for the further comments. :-)

I've added a feature request here: http://www.purebasic.fr/english/viewtop ... =3&t=70343
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Convert all special chars in a text into regular letters...

Post by Andre »

I just started using PB6.00 alpha - and that's no surprise it don't work "out of the box" with my big project codes (105,000 and 40,000 lines). So I need to test and adapt step by step, to handle the needed changes...

One thing I came up with is the InlineASM code by 'wilbert' in this thread, which I heavily used for conversion of strings with umlauts/special chars into simple A-Z chars. This gives a "PureBasic - Assembler error" with the notes "error - unknown type name 'mov'" and "error: expected identifier or '(' before '[' token mov eax, [p.p_ConversionTable]".

Is this something which can be adapted? Or does the C-backend cause, that all InlineASM codes can't be used anymore?

Thanks for your help! :-)
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
User avatar
Keya
Addict
Addict
Posts: 1891
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert all special chars in a text into regular letters...

Post by Keya »

Andre I don't think inline assembly is currently supported in the PB6 Alpha? (I think Fred's trying to get the C part working first!) It seems all "! code" lines are currently sent directly to the C compiler, as opposed to being interpreted as asm. Though you still can use gcc's inline assembly, that's pretty "quirky" though compared to what we're used to!
Post Reply