PureBasic Forums - English

Posted: **Sun Feb 08, 2015 8:55 pm**

I have a program that reads and massages a text file that is written in Esperanto using the cx, gx, method of doing non-English characters. I have the program written so it handles ASCII text without a problem. When I change the text file so that it has the proper Esperanto characters, which can be found in Unicode, it no longer works. How does the program have to be changed to work with Unicode? I have the arrays that are used defined as xxxArray.s(0), and the majority of the variables are .i or .l as the case may be. Also, I can't find anything definite on how to use dynamic arrays, that is, arrays that are defined as xxxArray.s(). Any help would be appreciated.

Norm

Posted: **Sun Feb 08, 2015 9:25 pm**

Is it possible to see any examples of the source code?

You'll only get general answers without being more specific.

Since you are challenged by arrays in addition to unicode can you be more specific on both items?

Posted: **Sun Feb 08, 2015 9:49 pm**

In any case, for properly handling Unicode text, at least 2 settings should be made in the PureBasic IDE

Compiler > Compiler Options ... > [v] Create Unicode executable
File > File format > Encoding: UTF-8

Posted: **Sun Feb 08, 2015 10:06 pm**

OK, here's the code. I hope I do this correctly.

Code: Select all

filehandle.i
txt.s
timesThrough.i
Global Dim concord.s(1)
maxRec.i

; Open the file containing the raw data.
filehandle = PB_Any
result = ReadFile(filehandle, "D:\testfile.txt")
If err > 0
  Debug "Error opening file."
  End
EndIf
    
For timesThrough = 1 To 50  ;2336
  txt = ""
  
  ; **** Read each line of raw data.
  If Lof(filehandle) > 0
    txt = ReadString(filehandle)
  EndIf
    
  ; *** Strip off the Bible reference for later use.
  startPos.i = 5
  startPos2.i = FindString(txt, " ", 5)
  
  If Left(txt, 3) = "Mat"
    ref.s = " " + Left(txt, startPos2 - 1) + ","
    ref2.s = " " + Mid(txt, startPos, startPos2 - startPos) + ","
  Else
    ref = ref
    ref2 = ref2
  EndIf
  txt = Mid(txt, startPos2 + 1)
  
  ; **** Parse the line of data into a one-dimensional array
  ; of individual values as delimited by any of several
  ; characters. None of these characters are returned in
  ; the result. Provide a list of delimiters.
  delimitList.s
  oneChar.s
  aWord.s
  codeCount.i
  Dim wordList.s(0);0
  ReDim wordList(0);0
  
  ; Characters recognized as delimiters.
  delimitList = " ,.;:+/!|?()'"+Chr(34)
  
  i.i
  j.i
  k.i
  codeCount = 0
  
  i = Len(txt)
  For j = 1 To i
    ; Read one char at a time.
    oneChar = Mid(txt, j, 1)
    ; Is this one a delimiter?
    k = FindString(delimitList, oneChar)
    ; Is it isn't, add to the current word.
    If k = 0
      aWord = aWord + oneChar
    EndIf
    ; If it is, or if we're finished ...
    If k <> 0 Or j = i
      If aWord > ""
        ; Save the word.
        codeCount = codeCount + 1
        ReDim wordList(codeCount)
        wordList(codeCount) = aWord
        aWord = ""
      EndIf
    EndIf
  Next j
  
  ; **** Sort the array into alphabetical order.
  SortArray(wordList(), 2)
  
  ; **** Attach the reference to each word in the array.
  ;For index.i = 1 To ArraySize(wordList())
  ;  wordList(index) + ref
  ;Next index
 
  ; **** Remove the duplicated words from the array
  ; keeping count of the words removed.
  lowbound.i
  upBound.i
  Dim tempArray.s(0);0
  cur.i
  A.i
  B.i
  
  ; Check for empty array.
  If (wordList(0) = "")
    Break
  EndIf
    
  ; We need these often.
  lowBound = 1 ;lbound(wordList)
  upBound = ArraySize(wordList())
  
  ; Reserve check buffer.
  ReDim tempArray(upBound)
  
  ; Set first item.
  cur = lowBound
  tempArray(cur) = wordList(lowBound)
    
  ; Loop through all items.
  For A = lowBound + 1 To upBound
    ; Make a comparison against all items.
    For B = lowBound To cur
      ; If it is a duplicate, exit array.
      If StringByteLength(tempArray(B)) = StringByteLength(wordList(A))
        If wordList(A) = tempArray(B)
          Break
        EndIf
      EndIf
    Next B
    ; Check if the loop was exited; add new item to check buffer if not.
    If B > cur
      cur = B
      tempArray(cur) = wordList(A)
    EndIf
  Next A
  ; Fix size.
  ReDim tempArray(cur)
  ; Copy
  ReDim wordList(cur)
  CopyArray(tempArray(), wordList())
  
  ; **** Sort the new list.
  SortArray(wordList(), 2)
  
  ; **** Append the wordList array onto the concord array.
  If concord(0) = ""
  ;If SafeArrayGetDim(concord) > 0
    ; Merges two arrays.
    ; wordList is appended to the end of concord.
    ; If killSource is set to true, then
    ; wordList is erased following the merge.
    el.i
    pos.i
    temp.i
    uBoundSource.i
    lBoundSource.i
    uBoundDest.i
    killSource.i
    true.i = 1
    killSource = true
    
    If (wordList(1)="") And (concord(1)="")
      Break
    EndIf

    lBoundSource = 1
    uBoundSource = ArraySize(wordList())
    uBoundDest = ArraySize(concord())
    temp = uBoundSource - lBoundSource + 1
    
    pos = ArraySize(concord()) + 1
    ReDim concord(ArraySize(concord()) + temp)
    For el = uBoundDest To pos Step -1
      concord(el + temp) = concord(el)
    Next el
    uBoundSource = pos + temp - 1
    For el = pos To uBoundSource
      concord(el) = wordList(el - pos)
    Next el
    If killSource = true 
      FreeArray(wordList())
    EndIf
  Else
    ReDim concord(ArraySize(wordList()))
    CopyArray(wordList(), concord())
  EndIf
  
  ; **** Sort the concord array.
  SortArray(concord(), 2)
    
  ; **** Remove the duplicated words from the array
  ; keeping count of the words removed.
  
  ; Check for empty array.
  ;If (concord(1) = "")
  ;  Break
  ;EndIf
    
  ; We need these often.
  lowBound = 0
  upBound = ArraySize(concord())
  
  ; Reserve check buffer.
  ReDim tempArray(upBound)
  
  ; Set first item.
  cur = lowBound
  tempArray(cur) = concord(lowBound)
  
  ; Loop through all items.
  For A = lowBound + 1 To upBound
    ; Make a comparison against all items.
    For B = lowBound To cur
      ; If it is a duplicate, exit array.
      If StringByteLength(tempArray(B)) = StringByteLength(concord(A))
        If concord(A) = tempArray(B)
          Break
        EndIf
      EndIf
    Next B
    ; Check if the loop was exited; add new item to check buffer if not.
    If B > cur
      cur = B
      tempArray(cur) = concord(A)
    EndIf
  Next A
  ; Fix size.
  ReDim tempArray(cur)
  ; Copy
  ReDim concord(cur)
  CopyArray(tempArray(), concord())
  ;SortArray(concord(), 2)

Next timesThrough

; **** Finally, sort and print the concord array.
SortArray(concord(), 2)

For index = 1 To ArraySize(concord())
  If concord(index) > "@"
    Debug(concord(index))
  EndIf
Next index
Debug Str(ArraySize(concord()))
Debug ref

End

The output of the program with a Unicode textfile was a single digit "1" in the debug window.
It should have been a list of words in alphabetical order with no duplicated words.

Posted: **Sun Feb 08, 2015 10:07 pm**

Thanks, Little John. The changes have been made, still doesn't work.

Posted: **Sun Feb 08, 2015 10:16 pm**

By the way, this is the output I'm trying to get (the program isn't complete, yet):

Abija Mat 1:7(2),
Abiud Mat 1:13(2),
Abraham Mat 1:1, 1:2, 1:17,
adorklinig^i Mat 2:2, 2:8,
adorklinig^is Mat 2:11,
Ahaz Mat 1:9(2),
Ah^im Mat 1:14(2),
Akridoj Mat. 3.4,
al Mat 1:18, 1:20, 2:4, 2:5, 2:8, 2:11, 2:12(3), 2:13(2),
.
.
.
etc.

The ^ should be part of the preceding letter, but I don't know how to find that letter in this venue.

Norm

Posted: **Sun Feb 08, 2015 10:27 pm**

Thanks for the code sample. Can you also provide a partial list of the contents of the text file, perhaps 30 lines or so?

That should be all that is needed to give you a detailed answer.

Posted: **Sun Feb 08, 2015 10:57 pm**

Sure, Demivec, here's the first chapter of Matthew.

Code: Select all

Mat 1:1 La libro de la genealogio de Jesuo Kristo, filo de David, filo de Abraham.
Mat 1:2 Al Abraham naskigis Isaak, kaj al Isaak naskigis Jakob, kaj al Jakob naskigis Jehuda kaj liaj fratoj,
Mat 1:3 kaj al Jehuda naskigis Perec kaj Zeraĥ el Tamar, kaj al Perec naskigis Ĥecron, kaj al Ĥecron naskigis Ram,
Mat 1:4 kaj al Ram naskigis Aminadab, kaj al Aminadab naskigis Naĥŝon, kaj al Naĥŝon naskigis Salma,
Mat 1:5 kaj al Salma naskigis Boaz el Raĥab, kaj al Boaz naskigis Obed el Rut, kaj al Obed naskigis Jiŝaj,
Mat 1:6 kaj al Jiŝaj naskigis David, la reĝo. Kaj al David la reĝo naskigis Salomono el [la] [edzino] de Urija,
Mat 1:7 kaj al Salomono naskigis Reĥabeam, kaj al Reĥabeam naskigis Abija, kaj al Abija naskigis Asa,
Mat 1:8 kaj al Asa naskigis Jehoŝafat, kaj al Jehoŝafat naskigis Joram, kaj al Joram naskigis Uzija,
Mat 1:9 kaj al Uzija naskigis Jotam, kaj al Jotam naskigis Aĥaz, kaj al Aĥaz naskigis Ĥizkija,
Mat 1:10 kaj al Ĥizkija naskigis Manase, kaj al Manase naskigis Amon, kaj al Amon naskigis Joŝija,
Mat 1:11 kaj al Joŝija naskigis Jeĥonja kaj liaj fratoj, je la tempo de la deporto en Babelon.
Mat 1:12 Kaj post la deporto en Babelon, al Jeĥonja naskigis Ŝealtiel, kaj al Ŝealtiel naskigis Zerubabel,
Mat 1:13 kaj al Zerubabel naskigis Abiud, kaj al Abiud naskigis Eljakim, kaj al Eljakim naskigis Azor,
Mat 1:14 kaj al Azor naskigis Cadok, kaj al Cadok naskigis Aĥim, kaj al Aĥim naskigis Eliud,
Mat 1:15 kaj al Eliud naskigis Eleazar, kaj al Eleazar naskigis Mattan, kaj al Mattan naskigis Jakob,
Mat 1:16 kaj al Jakob naskigis Jozef, edzo de Maria, el kiu estis naskita Jesuo, kiu estas nomata Kristo.
Mat 1:17 Tial ĉiuj generacioj de Abraham ĝis David [estis] dek kvar generacioj, kaj de David ĝis la deporto en Babelon [estis] dek kvar generacioj, kaj de la deporto en Babelon ĝis la Kristo [estis] dek kvar generacioj.
Mat 1:18 Nun la naskiĝo de Jesuo Kristo estis tiel: Ĉar lia patrino Maria estis fianĉinigita al Jozef, antaŭ ol ili kunvenis, ŝi troviĝis havanta [bebo] en utero per la Sankta Spirito. 
Mat 1:19 Sed Jozef, ŝia edzo, estante justa, kaj ne volante elmeti ŝin publike, konsideris ŝin sekrete forsendi.
Mat 1:20 Kaj [kiam] li pripensis tion, jen anĝelo de [la] Sinjoro aperis al li en sonĝo, dirante, "Jozef, filo de David, ne timu preni Maria [kiel] via edzino, ĉar la [bebo] en ŝi, estis koncipiĝita per la Sankta Spirito.
Mat 1:21 Kaj ŝi naskos filon; kaj vi vokos lian nomon Jesuo; ĉar li savos sian popolon de ĝiaj pekoj."
Mat 1:22 Nun ĉio [tio] okazis, por ke plenumiĝu la [vorto] la Sinjoro parolis per la profeto, dirante:
Mat 1:23 Jen virgulino havos [bebo] en utero kaj naskos filon, Kaj oni vokos lian nomon Emanuel, tio estas, tradukata, "Dio kun ni."
Mat 1:24 Kaj Jozef, estas vekita de [lia] dormo, faris kiel ordonis lin [per] la anĝelo de [la] Sinjoro, kaj li prenis sian edzinon,
Mat 1:25 kaj ne konis ŝin, ĝis ŝi naskis ŝian filon [la]  unuenaskitan; kaj li vokis lian nomon Jesuo.

As you can see, what I am trying to do is build a concordance.

Posted: **Mon Feb 09, 2015 12:31 am**

Hello NormJ,

unfortunately, I don't have the time to look carefully at all your code.
It certainly needs some reworking.

Just a few notes:

Code: Select all

; Open the file containing the raw data.
filehandle = PB_Any
result = ReadFile(filehandle, "D:\testfile.txt")
If err > 0
   Debug "Error opening file."
   End
EndIf

How should the variable err become > 0 here?
An error with ReadFile() is indicated by a return value of 0 (as documented in the help for ReadFile()).
So that code snippet should look like this:

Code: Select all

; Open the file containing the raw data.
filehandle = ReadFile(#PB_Any, "D:\testfile.txt")
If filehandle = 0
   Debug "Error opening file."
   End
EndIf

Also replace

Code: Select all

   If Lof(filehandle) > 0
      txt = ReadString(filehandle)
   EndIf

with

Code: Select all

   If Eof(filehandle)
      Break
   EndIf   
   
   txt = ReadString(filehandle, #PB_UTF8)

and be sure that the file which contains the text is saved in UTF-8 format.

Posted: **Mon Feb 09, 2015 3:56 am**

That seems like a lot of effort to strip and remove duplicates ,a Trie would be the better structure
for the job your doing
http://en.wikipedia.org/wiki/Trie

Regarding the unicode issue, use single quoted characters instead of chr(34) ..
like this

Code: Select all

mystring.s = "a b" 

For a = 1 To Len(mystring) 
  c.s = Mid(mystring,a,1)
  If Chr(' ') = c   <----
    Debug "space"
  EndIf 
Next

Posted: **Mon Feb 09, 2015 7:22 am**

If you are open to suggestions ...
Consider a Map for building the concordance. It's easier in this case compared to an array.

Posted: **Mon Feb 09, 2015 7:59 am**

@NormJ: There are several issues with the program. I am aware that is a work in progress and so I will touch on two of the ones you mentioned.

The first is that the text file you are reading is probably not in ASCII and is in either Unicode or UTF8. It probably contains a BOM (Byte Order Mark) as the first few bytes. You will need to check for this first so that you can skip past it to the actual text data and to also make sure you are reading that data in the proper format.

When you were using an ASCII file for the Esperanto text before you used the Cx Gx method to mark the special characters. Now you would be using either Unicode or UTF8 to mark the actual character values instead of a substitution code. This means it is important to get this right otherwise your characterwise comparisons later will be off by a few bytes right from the start.

Here is some code that I used near the beginning of your code to handle this, along with some things that Little John mentioned also:

Code: Select all

filehandle = ReadFile(#PB_Any, "D:\testfile.txt")
If filehandle = 0
  Debug "Error opening file."
  End
EndIf

If Not Eof(filehandle)
  bom = ReadStringFormat(filehandle)
  Select bom
    Case #PB_Ascii, #PB_UTF8, #PB_Unicode ;no problem
    Case #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
      MessageRequester("Error", "Text file is in a text format that cannot be read.")
      End
  EndSelect
  
EndIf

For timesThrough = 1 To 50  ;2336
  txt = ""

;code continues from here

The second issue that you are running into is that in PureBasic an array's indexes run from 0 -> DIM_Size.

Because you are storing strings in the arrays and because you begin by storing things at index #1, the first index (#0) is an empty string. When you later sort the array in ascending order any empty strings will be placed at the lowest indexes. You then look at index #0 to see if it is an empty string and conclude that the entire array has nothing in it as a result.

You need to start filling the array at index #0 and continue filling up to and including the highest index.

This would mean that code such as:

Code: Select all

 ; If it is, or if we're finished ...
    If k <> 0 Or j = i
      If aWord > ""
        ; Save the word.
        codeCount = codeCount + 1
        ReDim wordList(codeCount)
        wordList(codeCount) = aWord
        aWord = ""
      EndIf
    EndIf

would be rewritten as:

Code: Select all

; If it is, or if we're finished ...
    If k <> 0 Or j = i
      If aWord > ""
        ; Save the word.
        ReDim wordList(codeCount)
        wordList(codeCount) = aWord
        codeCount = codeCount + 1
        aWord = ""
      EndIf
    EndIf

I also agree with wilbert that it would be easier to use a Map instead of an array to build the concordance.

I noticed also that you will run into at least one tricky situation when dealing with the lettercase of things in your concordance. You will need to take some steps to deal with things like 'Libro' and 'libro' each having separate concordance entries if that is considered undesirable. Otherwise both entries would have to be referenced when looking for text.

Posted: **Mon Feb 09, 2015 6:37 pm**

Gentlemen,

Thank you all for your comments. Demivec and Little John, I am considering your corrections, and will probably use them. Thanks.

idle, I do not know how to use a Map, but I will read up on it, and maybe that is the best way to go.

Is there other documentation available for PureBasic?
I have kale's book, as well as Krylar's book, and the official reference for 5.30, but none of them go into as much detail as I would like
and am used to, from using a lot of the MS documentation for VB. 8-\

Norm

Posted: **Mon Feb 09, 2015 7:29 pm**

NormJ, you are welcome!
I hope you can make your code work like you want. If not, please don't hesitate to ask here again for help.

If you want to do serious programming, IMHO it is sooner or later recommended to get one or more good book(s) about algorithms and data structures, e.g. CLRS. However, I don't know whether something like that is actually what you're looking for.

Oh, by the way, here is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Posted: **Mon Feb 09, 2015 8:16 pm**

Excuses for a bit stealing this topic but as I'm (too) quit a noob on this ...
* Are there rules (or tools) to check if a source needs unicode or not ?
* Will the use of FindString, StringField or ExtractRegularExpression etc. in my sources have any influence when they are used with unicode files (already working fine with only ASCII) ?
* If setting these below, will it make a difference too if no unicode is in use ?

In any case, for properly handling Unicode text, at least 2 settings should be made in the PureBasic IDE
Compiler > Compiler Options ... > [v] Create Unicode executable
File > File format > Encoding: UTF-8

Thanks.

PureBasic Forums - English

Difference between ASCII and Unicode

Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode

Re: Difference between ASCII and Unicode