Page 1 of 2
Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 8:55 pm
by NormJ
I have a program that reads and massages a text file that is written in Esperanto using the cx, gx, method of doing non-English characters. I have the program written so it handles ASCII text without a problem. When I change the text file so that it has the proper Esperanto characters, which can be found in Unicode, it no longer works. How does the program have to be changed to work with Unicode? I have the arrays that are used defined as xxxArray.s(0), and the majority of the variables are .i or .l as the case may be. Also, I can't find anything definite on how to use dynamic arrays, that is, arrays that are defined as xxxArray.s(). Any help would be appreciated.
Norm
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 9:25 pm
by Demivec
Is it possible to see any examples of the source code?
You'll only get general answers without being more specific.
Since you are challenged by arrays in addition to unicode can you be more specific on both items?
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 9:49 pm
by Little John
In any case, for properly handling Unicode text, at least 2 settings should be made in the PureBasic IDE
- Compiler > Compiler Options ... > [v] Create Unicode executable
- File > File format > Encoding: UTF-8
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 10:06 pm
by NormJ
OK, here's the code. I hope I do this correctly.
Code: Select all
filehandle.i
txt.s
timesThrough.i
Global Dim concord.s(1)
maxRec.i
; Open the file containing the raw data.
filehandle = PB_Any
result = ReadFile(filehandle, "D:\testfile.txt")
If err > 0
Debug "Error opening file."
End
EndIf
For timesThrough = 1 To 50 ;2336
txt = ""
; **** Read each line of raw data.
If Lof(filehandle) > 0
txt = ReadString(filehandle)
EndIf
; *** Strip off the Bible reference for later use.
startPos.i = 5
startPos2.i = FindString(txt, " ", 5)
If Left(txt, 3) = "Mat"
ref.s = " " + Left(txt, startPos2 - 1) + ","
ref2.s = " " + Mid(txt, startPos, startPos2 - startPos) + ","
Else
ref = ref
ref2 = ref2
EndIf
txt = Mid(txt, startPos2 + 1)
; **** Parse the line of data into a one-dimensional array
; of individual values as delimited by any of several
; characters. None of these characters are returned in
; the result. Provide a list of delimiters.
delimitList.s
oneChar.s
aWord.s
codeCount.i
Dim wordList.s(0);0
ReDim wordList(0);0
; Characters recognized as delimiters.
delimitList = " ,.;:+/!|?()'"+Chr(34)
i.i
j.i
k.i
codeCount = 0
i = Len(txt)
For j = 1 To i
; Read one char at a time.
oneChar = Mid(txt, j, 1)
; Is this one a delimiter?
k = FindString(delimitList, oneChar)
; Is it isn't, add to the current word.
If k = 0
aWord = aWord + oneChar
EndIf
; If it is, or if we're finished ...
If k <> 0 Or j = i
If aWord > ""
; Save the word.
codeCount = codeCount + 1
ReDim wordList(codeCount)
wordList(codeCount) = aWord
aWord = ""
EndIf
EndIf
Next j
; **** Sort the array into alphabetical order.
SortArray(wordList(), 2)
; **** Attach the reference to each word in the array.
;For index.i = 1 To ArraySize(wordList())
; wordList(index) + ref
;Next index
; **** Remove the duplicated words from the array
; keeping count of the words removed.
lowbound.i
upBound.i
Dim tempArray.s(0);0
cur.i
A.i
B.i
; Check for empty array.
If (wordList(0) = "")
Break
EndIf
; We need these often.
lowBound = 1 ;lbound(wordList)
upBound = ArraySize(wordList())
; Reserve check buffer.
ReDim tempArray(upBound)
; Set first item.
cur = lowBound
tempArray(cur) = wordList(lowBound)
; Loop through all items.
For A = lowBound + 1 To upBound
; Make a comparison against all items.
For B = lowBound To cur
; If it is a duplicate, exit array.
If StringByteLength(tempArray(B)) = StringByteLength(wordList(A))
If wordList(A) = tempArray(B)
Break
EndIf
EndIf
Next B
; Check if the loop was exited; add new item to check buffer if not.
If B > cur
cur = B
tempArray(cur) = wordList(A)
EndIf
Next A
; Fix size.
ReDim tempArray(cur)
; Copy
ReDim wordList(cur)
CopyArray(tempArray(), wordList())
; **** Sort the new list.
SortArray(wordList(), 2)
; **** Append the wordList array onto the concord array.
If concord(0) = ""
;If SafeArrayGetDim(concord) > 0
; Merges two arrays.
; wordList is appended to the end of concord.
; If killSource is set to true, then
; wordList is erased following the merge.
el.i
pos.i
temp.i
uBoundSource.i
lBoundSource.i
uBoundDest.i
killSource.i
true.i = 1
killSource = true
If (wordList(1)="") And (concord(1)="")
Break
EndIf
lBoundSource = 1
uBoundSource = ArraySize(wordList())
uBoundDest = ArraySize(concord())
temp = uBoundSource - lBoundSource + 1
pos = ArraySize(concord()) + 1
ReDim concord(ArraySize(concord()) + temp)
For el = uBoundDest To pos Step -1
concord(el + temp) = concord(el)
Next el
uBoundSource = pos + temp - 1
For el = pos To uBoundSource
concord(el) = wordList(el - pos)
Next el
If killSource = true
FreeArray(wordList())
EndIf
Else
ReDim concord(ArraySize(wordList()))
CopyArray(wordList(), concord())
EndIf
; **** Sort the concord array.
SortArray(concord(), 2)
; **** Remove the duplicated words from the array
; keeping count of the words removed.
; Check for empty array.
;If (concord(1) = "")
; Break
;EndIf
; We need these often.
lowBound = 0
upBound = ArraySize(concord())
; Reserve check buffer.
ReDim tempArray(upBound)
; Set first item.
cur = lowBound
tempArray(cur) = concord(lowBound)
; Loop through all items.
For A = lowBound + 1 To upBound
; Make a comparison against all items.
For B = lowBound To cur
; If it is a duplicate, exit array.
If StringByteLength(tempArray(B)) = StringByteLength(concord(A))
If concord(A) = tempArray(B)
Break
EndIf
EndIf
Next B
; Check if the loop was exited; add new item to check buffer if not.
If B > cur
cur = B
tempArray(cur) = concord(A)
EndIf
Next A
; Fix size.
ReDim tempArray(cur)
; Copy
ReDim concord(cur)
CopyArray(tempArray(), concord())
;SortArray(concord(), 2)
Next timesThrough
; **** Finally, sort and print the concord array.
SortArray(concord(), 2)
For index = 1 To ArraySize(concord())
If concord(index) > "@"
Debug(concord(index))
EndIf
Next index
Debug Str(ArraySize(concord()))
Debug ref
End
The output of the program with a Unicode textfile was a single digit "1" in the debug window.
It should have been a list of words in alphabetical order with no duplicated words.
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 10:07 pm
by NormJ
Thanks, Little John. The changes have been made, still doesn't work.
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 10:16 pm
by NormJ
By the way, this is the output I'm trying to get (the program isn't complete, yet):
Abija Mat 1:7(2),
Abiud Mat 1:13(2),
Abraham Mat 1:1, 1:2, 1:17,
adorklinig^i Mat 2:2, 2:8,
adorklinig^is Mat 2:11,
Ahaz Mat 1:9(2),
Ah^im Mat 1:14(2),
Akridoj Mat. 3.4,
al Mat 1:18, 1:20, 2:4, 2:5, 2:8, 2:11, 2:12(3), 2:13(2),
.
.
.
etc.
The ^ should be part of the preceding letter, but I don't know how to find that letter in this venue.
Norm
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 10:27 pm
by Demivec
Thanks for the code sample. Can you also provide a partial list of the contents of the text file, perhaps 30 lines or so?
That should be all that is needed to give you a detailed answer.
Re: Difference between ASCII and Unicode
Posted: Sun Feb 08, 2015 10:57 pm
by NormJ
Sure, Demivec, here's the first chapter of Matthew.
Code: Select all
Mat 1:1 La libro de la genealogio de Jesuo Kristo, filo de David, filo de Abraham.
Mat 1:2 Al Abraham naskigis Isaak, kaj al Isaak naskigis Jakob, kaj al Jakob naskigis Jehuda kaj liaj fratoj,
Mat 1:3 kaj al Jehuda naskigis Perec kaj Zeraĥ el Tamar, kaj al Perec naskigis Ĥecron, kaj al Ĥecron naskigis Ram,
Mat 1:4 kaj al Ram naskigis Aminadab, kaj al Aminadab naskigis Naĥŝon, kaj al Naĥŝon naskigis Salma,
Mat 1:5 kaj al Salma naskigis Boaz el Raĥab, kaj al Boaz naskigis Obed el Rut, kaj al Obed naskigis Jiŝaj,
Mat 1:6 kaj al Jiŝaj naskigis David, la reĝo. Kaj al David la reĝo naskigis Salomono el [la] [edzino] de Urija,
Mat 1:7 kaj al Salomono naskigis Reĥabeam, kaj al Reĥabeam naskigis Abija, kaj al Abija naskigis Asa,
Mat 1:8 kaj al Asa naskigis Jehoŝafat, kaj al Jehoŝafat naskigis Joram, kaj al Joram naskigis Uzija,
Mat 1:9 kaj al Uzija naskigis Jotam, kaj al Jotam naskigis Aĥaz, kaj al Aĥaz naskigis Ĥizkija,
Mat 1:10 kaj al Ĥizkija naskigis Manase, kaj al Manase naskigis Amon, kaj al Amon naskigis Joŝija,
Mat 1:11 kaj al Joŝija naskigis Jeĥonja kaj liaj fratoj, je la tempo de la deporto en Babelon.
Mat 1:12 Kaj post la deporto en Babelon, al Jeĥonja naskigis Ŝealtiel, kaj al Ŝealtiel naskigis Zerubabel,
Mat 1:13 kaj al Zerubabel naskigis Abiud, kaj al Abiud naskigis Eljakim, kaj al Eljakim naskigis Azor,
Mat 1:14 kaj al Azor naskigis Cadok, kaj al Cadok naskigis Aĥim, kaj al Aĥim naskigis Eliud,
Mat 1:15 kaj al Eliud naskigis Eleazar, kaj al Eleazar naskigis Mattan, kaj al Mattan naskigis Jakob,
Mat 1:16 kaj al Jakob naskigis Jozef, edzo de Maria, el kiu estis naskita Jesuo, kiu estas nomata Kristo.
Mat 1:17 Tial ĉiuj generacioj de Abraham ĝis David [estis] dek kvar generacioj, kaj de David ĝis la deporto en Babelon [estis] dek kvar generacioj, kaj de la deporto en Babelon ĝis la Kristo [estis] dek kvar generacioj.
Mat 1:18 Nun la naskiĝo de Jesuo Kristo estis tiel: Ĉar lia patrino Maria estis fianĉinigita al Jozef, antaŭ ol ili kunvenis, ŝi troviĝis havanta [bebo] en utero per la Sankta Spirito.
Mat 1:19 Sed Jozef, ŝia edzo, estante justa, kaj ne volante elmeti ŝin publike, konsideris ŝin sekrete forsendi.
Mat 1:20 Kaj [kiam] li pripensis tion, jen anĝelo de [la] Sinjoro aperis al li en sonĝo, dirante, "Jozef, filo de David, ne timu preni Maria [kiel] via edzino, ĉar la [bebo] en ŝi, estis koncipiĝita per la Sankta Spirito.
Mat 1:21 Kaj ŝi naskos filon; kaj vi vokos lian nomon Jesuo; ĉar li savos sian popolon de ĝiaj pekoj."
Mat 1:22 Nun ĉio [tio] okazis, por ke plenumiĝu la [vorto] la Sinjoro parolis per la profeto, dirante:
Mat 1:23 Jen virgulino havos [bebo] en utero kaj naskos filon, Kaj oni vokos lian nomon Emanuel, tio estas, tradukata, "Dio kun ni."
Mat 1:24 Kaj Jozef, estas vekita de [lia] dormo, faris kiel ordonis lin [per] la anĝelo de [la] Sinjoro, kaj li prenis sian edzinon,
Mat 1:25 kaj ne konis ŝin, ĝis ŝi naskis ŝian filon [la] unuenaskitan; kaj li vokis lian nomon Jesuo.
As you can see, what I am trying to do is build a concordance.
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 12:31 am
by Little John
Hello NormJ,
unfortunately, I don't have the time to look carefully at all your code.
It certainly needs some reworking.
Just a few notes:
Code: Select all
; Open the file containing the raw data.
filehandle = PB_Any
result = ReadFile(filehandle, "D:\testfile.txt")
If err > 0
Debug "Error opening file."
End
EndIf
How should the variable
err become > 0 here?
An error with ReadFile() is indicated by a return value of 0 (as documented in the help for ReadFile()).
So that code snippet should look like this:
Code: Select all
; Open the file containing the raw data.
filehandle = ReadFile(#PB_Any, "D:\testfile.txt")
If filehandle = 0
Debug "Error opening file."
End
EndIf
Also replace
Code: Select all
If Lof(filehandle) > 0
txt = ReadString(filehandle)
EndIf
with
Code: Select all
If Eof(filehandle)
Break
EndIf
txt = ReadString(filehandle, #PB_UTF8)
and be sure that the file which contains the text is saved in
UTF-8 format.
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 3:56 am
by idle
That seems like a lot of effort to strip and remove duplicates ,a Trie would be the better structure
for the job your doing
http://en.wikipedia.org/wiki/Trie
Regarding the unicode issue, use single quoted characters instead of chr(34) ..
like this
Code: Select all
mystring.s = "a b"
For a = 1 To Len(mystring)
c.s = Mid(mystring,a,1)
If Chr(' ') = c <----
Debug "space"
EndIf
Next
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 7:22 am
by wilbert
If you are open to suggestions ...
Consider a Map for building the concordance. It's easier in this case compared to an array.
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 7:59 am
by Demivec
@NormJ: There are several issues with the program. I am aware that is a work in progress and so I will touch on two of the ones you mentioned.
The first is that the text file you are reading is probably not in ASCII and is in either Unicode or UTF8. It probably contains a BOM (Byte Order Mark) as the first few bytes. You will need to check for this first so that you can skip past it to the actual text data and to also make sure you are reading that data in the proper format.
When you were using an ASCII file for the Esperanto text before you used the Cx Gx method to mark the special characters. Now you would be using either Unicode or UTF8 to mark the actual character values instead of a substitution code. This means it is important to get this right otherwise your characterwise comparisons later will be off by a few bytes right from the start.
Here is some code that I used near the beginning of your code to handle this, along with some things that Little John mentioned also:
Code: Select all
filehandle = ReadFile(#PB_Any, "D:\testfile.txt")
If filehandle = 0
Debug "Error opening file."
End
EndIf
If Not Eof(filehandle)
bom = ReadStringFormat(filehandle)
Select bom
Case #PB_Ascii, #PB_UTF8, #PB_Unicode ;no problem
Case #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
MessageRequester("Error", "Text file is in a text format that cannot be read.")
End
EndSelect
EndIf
For timesThrough = 1 To 50 ;2336
txt = ""
;code continues from here
The second issue that you are running into is that in PureBasic an array's indexes run from 0 -> DIM_Size.
Because you are storing strings in the arrays and because you begin by storing things at index #1, the first index (#0) is an empty string. When you later sort the array in ascending order any empty strings will be placed at the lowest indexes. You then look at index #0 to see if it is an empty string and conclude that the entire array has nothing in it as a result.
You need to start filling the array at index #0 and continue filling up to and including the highest index.
This would mean that code such as:
Code: Select all
; If it is, or if we're finished ...
If k <> 0 Or j = i
If aWord > ""
; Save the word.
codeCount = codeCount + 1
ReDim wordList(codeCount)
wordList(codeCount) = aWord
aWord = ""
EndIf
EndIf
would be rewritten as:
Code: Select all
; If it is, or if we're finished ...
If k <> 0 Or j = i
If aWord > ""
; Save the word.
ReDim wordList(codeCount)
wordList(codeCount) = aWord
codeCount = codeCount + 1
aWord = ""
EndIf
EndIf
I also agree with wilbert that it would be easier to use a
Map instead of an array to build the concordance.
I noticed also that you will run into at least one tricky situation when dealing with the lettercase of things in your concordance. You will need to take some steps to deal with things like 'Libro' and 'libro' each having separate concordance entries if that is considered undesirable. Otherwise both entries would have to be referenced when looking for text.
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 6:37 pm
by NormJ
Gentlemen,
Thank you all for your comments. Demivec and Little John, I am considering your corrections, and will probably use them. Thanks.
idle, I do not know how to use a Map, but I will read up on it, and maybe that is the best way to go.
Is there other documentation available for PureBasic?
I have kale's book, as well as Krylar's book, and the official reference for 5.30, but none of them go into as much detail as I would like
and am used to, from using a lot of the MS documentation for VB. 8-\
Norm
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 7:29 pm
by Little John
NormJ, you are welcome!
I hope you can make your code work like you want. If not, please don't hesitate to ask here again for help.
If you want to do serious programming, IMHO it is sooner or later recommended to get one or more good book(s) about algorithms and data structures, e.g.
CLRS. However, I don't know whether something like that is actually what you're looking for.
Oh, by the way, here is
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Re: Difference between ASCII and Unicode
Posted: Mon Feb 09, 2015 8:16 pm
by Joris
Excuses for a bit stealing this topic but as I'm (too) quit a noob on this ...
* Are there rules (or tools) to check if a source needs unicode or not ?
* Will the use of FindString, StringField or ExtractRegularExpression etc. in my sources have any influence when they are used with unicode files (already working fine with only ASCII) ?
* If setting these below, will it make a difference too if no unicode is in use ?
In any case, for properly handling Unicode text, at least 2 settings should be made in the PureBasic IDE
Compiler > Compiler Options ... > [v] Create Unicode executable
File > File format > Encoding: UTF-8
Thanks.