PB5.50; Reading PB-Source; UTF-8; write UTF-8;

HanPBF · Post by **HanPBF** » Fri Sep 30, 2016 8:21 am

Hello,

I try to read PB source line by line and include
includeFile "..."
to write it out in another file.

I am struggling a little bit with Unicode vs. UTF-8.

While string in PB source is stored in Unicode, files are stored in UTF-8.

When I read line by line and write line by line the written source can no more be compiled due to syntax error in source code.
It seems the syntax error occurs due to non-UTF-8 chars because empty lines are complaint about by the compiler.

When I use WriteStringFormat at begin of generated file, does not work either.

Any idea how I can read source from and write to so characters are kept?

Thank in advance!

Btw I have this source:

Code: Select all

macro AddListElement(PList, PElement)
  AddElement(PList)
  PList = PElement
endMacro

procedure joinSourceFiles(PathToMainFile.s, PathToJoinedFile.s)
    protected i   .i
    protected newlist StackOfLines.s()
    protected newList StackOfPaths.s()
    protected newlist StackOfFiles.i()
    protected Path    .s
    protected Pos1    .i
    protected Pos2    .i
    protected F       .s                      ; current file
    protected MainPath  .s
    protected NrF     .i                      ; nr of main file
    protected NrJ     .i                      ; nr of joined file
    protected Line    .s
    
    protected IDEOptionsText    .s
    protected IncFileText       .s
    protected DQ                .s
    
    IDEOptionsText  = "; ide options = "
    IncFileText     = "includefile"
    DQ              = Chr(34)
    
    
    F = PathToMainFile
    MainPath = GetPathPart(F)
    
    ;i = 1
    
    ;AddListElement(StackOfLines(), i)
    
    NrJ = CreateFile(#PB_Any, PathToJoinedFile, #PB_UTF8)
    
    if NrJ
      ;WriteStringFormat(NrJ, #PB_Unicode); doesn't work
      
      if FileSize(PathToMainFile)
        NrF = ReadFile(#PB_Any, F, #PB_File_SharedRead | #PB_UTF8)
        
        if NrF
          while eof(NrF)=0
            Line = ReadString(NrF)
            
            if sourceLineStartsWith(Line, IncFileText)
              AddListElement(StackOfPaths(), MainPath)
              AddListElement(StackOfFiles(), NrF)
              
              Pos1 = FindString(Line, DQ)
              Pos2 = FindString(Line, DQ, Pos1 + 1)
              
              Path = mid(Line, Pos1 + 1, Pos2 - Pos1 - 1)
              
              debug Path
              
              if mid(Path, 2, 1)=":"
                NrF = ReadFile(#PB_Any, Path, #PB_File_SharedRead | #PB_UTF8)
                
              else              
                NrF = ReadFile(#PB_Any, MainPath + Path, #PB_File_SharedRead | #PB_UTF8)
                MainPath + GetPathPart(Path)
              endif 
              
              if EOF(NrF)
                CloseFile(NrF)
                
                NrF = StackOfFiles()
                DeleteElement(StackOfFiles())
                
                MainPath = StackOfPaths()
                DeleteElement(StackOfPaths())
                
              else
                Line = ReadString(NrF)  
                WriteStringN(NrJ, Line, #PB_UTF8)       
              endif
              
            elseif sourceLineStartsWith(Line, IDEOptionsText)
              if ListSize(StackOfFiles())>0
                CloseFile(NrF)
                
                NrF = StackOfFiles()
                DeleteElement(StackOfFiles())
              
                MainPath = StackOfPaths()
                DeleteElement(StackOfPaths())
                
                ;Line = StackOfLines()
                ;DeleteElement(StackOfLines())
              
                
              else
                break
              endif
            else
              WriteStringN(NrJ, Line, #PB_UTF8)
            endif
          wend
          
          CloseFile(NrF)
        endif
      endif
    endif
    
    CloseFile(NrJ)
  EndProcedure ; joinSourceFiles(PathToMainFile.s, PathToJoinedFile.s)

kenmo · Post by **kenmo** » Fri Sep 30, 2016 12:30 pm

It should be as simple as using

Code: Select all

WriteStringFormat(NrJ, #PB_UTF8)

instead of

Code: Select all

WriteStringFormat(NrJ, #PB_Unicode)

"Unicode" there indicates your file is in 16-bit character format, but it's not, you're writing UTF-8 strings.

EDIT: You may also want to use ReadStringFormat() at the beginning of the input file, otherwise a UTF-8 BOM might pass through your code as some extra unwanted characters.

HanPBF · Post by **HanPBF** » Fri Sep 30, 2016 1:14 pm

As You did say: solution was to use ReadStringFormat for each file as some were in ASCII and some in UTF-8.
Thanks a lot!

And I did read this: http://www.joelonsoftware.com/articles/Unicode.html (was obviously needed...)

As far as I learned from this article:

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

The argument against XML always was: to much space wasted; the contra argument was -> XML is zipped while transported.
Shouldn't this work for unicode and let unicode use 4bytes unpacked?
How would I do this to prevent any misunderstanding?
Or is UTF-8 the best solution?

kenmo · Post by **kenmo** » Fri Sep 30, 2016 1:45 pm

I have been researching encodings and Unicode a lot over the last few years...

Your link is a good read, shared often, but also a bit outdated.
The Unicode max is now $10FFFF, requiring 21 bits, lots of ways to encode it.
Wikipedia:

Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF. Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number.
For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X);
for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).

https://en.wikipedia.org/wiki/Unicode

By "4 bytes unpacked" you probably mean fixed 32-bit encoding like UTF-32
https://en.wikipedia.org/wiki/UTF-32
It has some advantages but is not very common.

Personally I am a UTF-8 fan, I think it has the overall best pros-vs-cons, most libraries seem to use it (examples: SDL, Scintilla), I use it for file/program/network IO, and I would use it for in-memory strings if PB ever added that mode

http://utf8everywhere.org/

One more thing. You might think PB cannot handle characters > $FFFF due to the 16-bit size.
But you can encode them as two 16-bit values called a "surrogate pair", for example here's the RUNNER emoji character $1F3C3

Code: Select all

CompilerIf (Not #PB_Compiler_Unicode)
  CompilerError "Compile in Unicode mode"
CompilerEndIf

Debug "Character $1F3C3 is stored in UTF-16 as $D83C + $DFC3"
high = $D83C
low  = $DFC3
str.s = Chr(high) + Chr(low)

Debug "PB says length is: " + Str(Len(str))
Debug "String displays as: " + str + " (if your font supports it)"

https://en.wikipedia.org/wiki/UTF-16

HanPBF · Post by **HanPBF** » Fri Sep 30, 2016 7:51 pm

Great explanations and links!

By "4 bytes" storing I meant use 4 bytes for each letter and to store it in less memory, let zip do the work. O.k. but I understand there is some history to those char codes.

But as You researched a lot I take this as a good advice:

Personally I am a UTF-8 fan, I think it has the overall best pros-vs-cons

Thanks a lot!

PureBasic Forums - English

PB5.50; Reading PB-Source; UTF-8; write UTF-8;

PB5.50; Reading PB-Source; UTF-8; write UTF-8;

Re: PB5.50; Reading PB-Source; UTF-8; write UTF-8;

Re: PB5.50; Reading PB-Source; UTF-8; write UTF-8;

Re: PB5.50; Reading PB-Source; UTF-8; write UTF-8;

Re: PB5.50; Reading PB-Source; UTF-8; write UTF-8;