Difference between ASCII and Unicode

Just starting out? Need help? Post your questions and find answers here.
Little John
Addict
Addict
Posts: 4869
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Difference between ASCII and Unicode

Post by Little John »

I'll try to give some answers.
Joris wrote:* Are there rules (or tools) to check if a source needs unicode or not ?
Unicode is only about strings, not about numbers, and not about music (as someone wrote here recently :D ).

All kinds of strings are affected: string constants, string variables, strings read from or written to a text file.
If all your strings contain only ASCII characters, then you don't need Unicode. But in many languages, there are so called "special characters". For correctly handling those characters in your program, it probably needs to be compiled in Unicode mode (in the end, it all depends on what your program exactly does).
Support for ASCII compilation with PureBasic will be dropped in the foreseeable future, so sooner or later all PB programs compiled with up to date PB versions will be Unicode programs. For these reasons, Unicode is the present and the future. ASCII is a technology from the last century. :-)
Joris wrote:* Will the use of FindString, StringField or ExtractRegularExpression etc. in my sources have any influence when they are used with unicode files (already working fine with only ASCII) ?
Read the help for the PB functions you are interested in. It should be documented whether they work differently in ASCII or Unicode mode. Notice if a function has optional parameters such as #PB_Ascii, #PB_Unicode, #PB_UTF8, ...
Joris wrote:* If setting these below, will it make a difference too if no unicode is in use ?
In any case, for properly handling Unicode text, at least 2 settings should be made in the PureBasic IDE
Compiler > Compiler Options ... > [v] Create Unicode executable
File > File format > Encoding: UTF-8
As far as I can see, File > File format > Encoding: UTF-8 will not make a difference if no unicode is in use.
In other words, this PB source file format is safe for ASCII and Unicode mode, and I'd recommend to use it for all new PB programs.
With existing PB programs that are saved as "plain text" files, it's a bit different: When you just switch the file format setting in the PB IDE from "plain text" to "UTF-8", it can happen that you'll get some unreadable characters. For conversion, better use a good text editor that has a command such as "Save as UTF-8".
Compiler > Compiler Options ... > [v] Create Unicode executable might or might not make your program work differently, it depends.
User avatar
TI-994A
Addict
Addict
Posts: 2791
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: Difference between ASCII and Unicode

Post by TI-994A »

NormJ wrote:...this is the output I'm trying to get...:

Abija Mat 1:7(2),
Abiud Mat 1:13(2),
Abraham Mat 1:1, 1:2, 1:17,
adorklinig^i Mat 2:2, 2:8,
adorklinig^is Mat 2:11,
Ahaz Mat 1:9(2),
Ah^im Mat 1:14(2),
Akridoj Mat. 3.4,
al Mat 1:18, 1:20, 2:4, 2:5, 2:8, 2:11, 2:12(3), 2:13(2)...
Hello Norm. Here's an example demonstrating your requirements. It utilises both lists and arrays, and illustrates the fundamentals of file processing with file type considerations:

Code: Select all

Enumeration 
  #MainWindow
  #FileName
  #FileType
  #Chapter
  #IndexView
  #OpenSelector
  #SaveSelector
EndEnumeration

Global fileFormat.i
Global Dim concordance.s(1)
Global Dim letterIndex.s(1)

Procedure CreateIndexButtons()
  x = 10 : y = 135
  ButtonGadget(99, 10, 105, 80, 20, "Full Index")
  For buttons = 0 To ArraySize(letterIndex())
    ButtonGadget(buttons + 100, x, y, 20, 20, letterIndex(buttons))
    x + 30
    If x > 70 
      x = 10
      y + 30
    EndIf
  Next buttons  
EndProcedure

Procedure RemoveIndexButtons()
  For buttons = 0 To ArraySize(letterIndex()) + 1
    If IsGadget(buttons + 99)
      FreeGadget(buttons + 99)
    EndIf
  Next buttons  
  ClearGadgetItems(#IndexView)
EndProcedure

Procedure displayFilteredIndex(button)
  ClearGadgetItems(#IndexView)
  For display = 0 To ArraySize(concordance())
    If Left(concordance(display), 1) = GetGadgetText(button)
      AddGadgetItem(#IndexView, -1, concordance(display))
    EndIf
  Next display 
EndProcedure

Procedure displayFullIndex()
  ClearGadgetItems(#IndexView)
  For display = 0 To ArraySize(concordance())
    AddGadgetItem(#IndexView, -1, concordance(display))
  Next display  
EndProcedure

Procedure SaveFile(outputFile.s)
  If Right(outputFile, 4) <> ".txt"
    outputFile + ".txt"
  EndIf
  fileNum = CreateFile(#PB_Any, outputFile)
  WriteStringFormat(fileNum, fileFormat)
  For writeloop = 0 To ArraySize(concordance())
    WriteStringN(fileNum, concordance(writeloop), fileFormat)
  Next writeloop
  CloseFile(fileNum)
  DisableGadget(#SaveSelector, 1)
  MessageRequester("Concordance Builder:", "Index saved to " + GetFilePart(outputFile))
EndProcedure

Procedure ProcessFile(inputFile.s)
  NewList wordIndex.s()
  Dim verses.s(1)
  fileNum = ReadFile(#PB_Any, inputFile)
  If fileNum
    fileFormat = ReadStringFormat(fileNum)
    If fileFormat = #PB_Ascii Or fileFormat = #PB_UTF8
      If fileFormat = #PB_Ascii
        fileFormat$ = "Plain ASCII File"
      Else
        fileFormat$ = "UTF8 Unicode File"
      EndIf
      SetGadgetText(#FileName, "  Filename: " + GetFilePart(inputFile))
      SetGadgetText(#FileType, "  File type: " + fileFormat$)
      While Not Eof(fileNum)
        ReDim verses(idx)
        verses(idx) = Trim(ReadString(fileNum, #PB_UTF8))
        idx + 1
      Wend
      CloseFile(fileNum)  
    Else
      MessageRequester("Concordance Builder:", "Unsupported file format")
      CloseFile(fileNum)
      ProcedureReturn 
    EndIf
    RemoveIndexButtons()
    delimiterList$ = ",.;:+/!|?()[]'" + Chr(34)
    For verse = 0 To ArraySize(verses())
      verseSplitter = FindString(verses(verse), Space(1), 
                      FindString(verses(verse), Space(1)) + 1)
      verseNum$ = " " + Left(verses(verse), verseSplitter - 1)
      versePrefix$ =  Trim(Left(verseNum$, 4))
      verseContent$ = Mid(verses(verse), verseSplitter)
      For stripDelimiters = 1 To Len(delimiterList$)
        verseContent$ = ReplaceString(verseContent$, 
                        Mid(delimiterList$, stripDelimiters, 1), Space(1))
      Next stripDelimiters
      For stripExtraSpaces = 1 To Len(verseContent$)
        verseContent$ = Trim(ReplaceString(verseContent$, Space(2), 
                        Space(1), #PB_String_NoCase, stripExtraSpaces))
      Next stripExtraSpaces
      For word = 1 To CountString(verseContent$, Space(1)) + 1
        word$ = StringField(verseContent$, word, Space(1)) + verseNum$
        word$ = UCase(Left(word$, 1)) + Mid(word$, 2)
        ResetList(wordIndex())
        While NextElement(wordIndex())
          If Left(wordIndex(), Len(word$)) = word$
            duplicates = Val(Mid(wordIndex(), (FindString(wordIndex(), "(") + 1), 
                         (FindString(wordIndex(), ")") - (FindString(wordIndex(), "(") + 1)))) + 1
            If duplicates = 1 : duplicates = 2 : EndIf
            duplicate$ = "(" + Str(duplicates) + ")"
            wordIndex() = Left(wordIndex(), Len(word$)) + duplicate$ 
          EndIf
        Wend
        If duplicates
          duplicates = 0
        Else
          LastElement(wordIndex())
          AddElement(wordIndex()) : wordIndex() = word$
        EndIf
      Next word
    Next verse
    SortList(wordIndex(), #PB_Sort_Ascending | #PB_Sort_NoCase)
    ResetList(wordIndex())
    While NextElement(wordIndex())
      If index > 0 
        If Left(wordIndex(), 1) <> Left(concordance(index - 1), 1)
          ReDim letterIndex(index2)
          letterIndex(index2) = Left(wordIndex(), 1)
          index2 + 1
        EndIf
        If Trim(Left(wordIndex(), FindString(wordIndex(), versePrefix$) - 1)) = "" +
           Trim(Left(concordance(index - 1), FindString(concordance(index - 1), versePrefix$) - 1))
          concordance(index - 1) + ", " + Mid(wordIndex(), FindString(wordIndex(), versePrefix$))
        Else
          ReDim concordance(index)
          concordance(index) = wordIndex()
          index + 1
        EndIf
      Else
        concordance(index) = wordIndex()
        index + 1
        letterIndex(index2) = Left(wordIndex(), 1)
        index2 + 1
      EndIf
    Wend
    For display = 0 To ArraySize(concordance())
      AddGadgetItem(#IndexView, -1, concordance(display))
    Next display
    CreateIndexButtons()
    DisableGadget(#SaveSelector, 0)  
    SetGadgetText(#Chapter, "  Chapter prefix: " + StringField(verses(0), 1, Space(1)))
  EndIf  
EndProcedure

wFlags = #PB_Window_SystemMenu | #PB_Window_ScreenCentered
OpenWindow(#MainWindow, #PB_Any, #PB_Any, 360, 460, "Concordance Builder", wFlags)
TextGadget(#FileName, 10, 10, 340, 25, "")
TextGadget(#FileType, 10, 40, 165, 25, "")
TextGadget(#Chapter, 185, 40, 165, 25, "")
ListViewGadget(#IndexView, 100, 105, 250, 310)
ButtonGadget(#OpenSelector, 10, 70, 340, 25, "SELECT FILE")
ButtonGadget(#SaveSelector, 10, 425, 340, 25, "SAVE FILE")
SetGadgetColor(#FileName, #PB_Gadget_BackColor, RGB(200, 200, 200))
SetGadgetColor(#FileType, #PB_Gadget_BackColor, RGB(200, 200, 200))
SetGadgetColor(#Chapter, #PB_Gadget_BackColor, RGB(200, 200, 200))
SendMessage_(GadgetID(#IndexView), #LB_SETHORIZONTALEXTENT, 1500, 0)
DisableGadget(#SaveSelector, 1)

Repeat
  Select WaitWindowEvent()
    Case #PB_Event_CloseWindow
      appQuit = 1
    Case #PB_Event_Gadget
      Select EventGadget()
        Case #OpenSelector
          fileType$ = "Text Files|*.txt;*.bat;"
          inputFile$ = OpenFileRequester("Please select input file:", "", fileType$, 0)
          If inputFile$
            processFile(inputFile$)
          EndIf
        Case #SaveSelector
          fileType$ = "Text Files|*.txt;*.bat;"
          outputFile$ = SaveFileRequester("Please select output file:", "", fileType$, 0)
          If outputFile$
            SaveFile(outputFile$)
          EndIf          
        Case 99
          displayFullIndex()
        Case 100 To ArraySize(letterIndex()) + 100
          displayFilteredIndex(EventGadget())
      EndSelect
  EndSelect
Until appQuit = 1
Please ensure that it is compiled with the UNICODE switch [Compiler > Compiler Options > Create unicode executable], and that your test file is saved in the UTF-8 format, in this layout:

Code: Select all

Mat 1:1 La libro de la genealogio de Jesuo Kristo, filo de David, filo de Abraham.
Mat 1:2 Al Abraham naskigis Isaak, kaj al Isaak naskigis Jakob, kaj al Jakob naskigis Jehuda kaj liaj fratoj,
Mat 1:3 kaj al Jehuda naskigis Perec kaj Zeraĥ el Tamar, kaj al Perec naskigis Ĥecron, kaj al Ĥecron naskigis Ram,
Mat 1:4 kaj al Ram naskigis Aminadab, kaj al Aminadab naskigis Naĥŝon, kaj al Naĥŝon naskigis Salma,
Mat 1:5 kaj al Salma naskigis Boaz el Raĥab, kaj al Boaz naskigis Obed el Rut, kaj al Obed naskigis Jiŝaj,
Mat 1:6 kaj al Jiŝaj naskigis David, la reĝo. Kaj al David la reĝo naskigis Salomono el [la] [edzino] de Urija,
Mat 1:7 kaj al Salomono naskigis Reĥabeam, kaj al Reĥabeam naskigis Abija, kaj al Abija naskigis Asa,
...
...
Hope you'd find it helpful. :)
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
NormJ
User
User
Posts: 11
Joined: Fri Feb 06, 2015 8:35 pm

Re: Difference between ASCII and Unicode

Post by NormJ »

TI-994A, many thanks!

Wow! It took you what, about a day, to do that! I've been working on mine since the middle of last month. I started out in VB6, then tried to do it in VB2005, then finally started looking for something in open source vein, when I found Pure. So what I have done so far is more or less a direct translation from a Microsoft-type Basic into PureBasic. What I'm doing now, is learning all the intricacies of PureBasic (at least, I'm trying to), and I will be studying hard what you just sent me.

Another question for the forum. In my reading so far, I haven't found any way to print out a listing of a program. I'm sure it's very easy, but I haven't found a print command in the IDE.?

Norm
Joris
Addict
Addict
Posts: 890
Joined: Fri Oct 16, 2009 10:12 am
Location: BE

Re: Difference between ASCII and Unicode

Post by Joris »

Yeah, thanks guys, Little John !
Another question for the forum. In my reading so far, I haven't found any way to print out a listing of a program. I'm sure it's very easy, but I haven't found a print command in the IDE.?
Once you've saved your source, you can at least, simply open it in any text editor (print in PB I don't know yet... :oops:).
Last edited by Joris on Wed Feb 11, 2015 10:26 am, edited 1 time in total.
Yeah I know, but keep in mind ... Leonardo da Vinci was also an autodidact.
User avatar
heartbone
Addict
Addict
Posts: 1058
Joined: Fri Apr 12, 2013 1:55 pm
Location: just outside of Ferguson

Re: Difference between ASCII and Unicode

Post by heartbone »

Little John wrote:I'll try to give some answers.
{snip} For these reasons, Unicode is the present and the future. ASCII is a technology from the last century. :-)
So is hardcopy. :D
Keep it BASIC.
Post Reply