Wierd bytes $3F in the middle of chars 0-255

Just starting out? Need help? Post your questions and find answers here.
User avatar
oakvalley
User
User
Posts: 76
Joined: Sun Aug 08, 2004 6:34 pm
Location: Norway
Contact:

Wierd bytes $3F in the middle of chars 0-255

Post by oakvalley »

Hi there,

Try this code in PB5.60:

Code: Select all

For t=0 To 255
    string$=string$+RSet(Str(t),3,"0")+"|"
Next t

CreateFile(2,"d:\wierd.txt",#PB_Ascii)

For t=0 To 255
    WriteString(2,Chr(Val(Left(StringField(string$,t+1,"|"),3))),#PB_Ascii)
Next t
CloseFile(2)
Take a look in a Hex Editor. It counts as expected from $01-$7F, then suddenly $80 becomes $3F, then $81 is in place where I would expect, then lots of $3F and other wired bytes again, until suddenly when it reaches $A0 it counts fine until $FF

How come?

I already said #PB_Ascii for both ReadFile and WriteString. Magic ASCII range from 128-160 all of a sudden?
Regards Stone Oakvalley
Currently @ PB 5.70
User avatar
kenmo
Addict
Addict
Posts: 1967
Joined: Tue Dec 23, 2003 3:54 am

Re: Wierd bytes $3F in the middle of chars 0-255

Post by kenmo »

ASCII and Unicode definitions do not match in the $80-$9F (128-159) range.

The Unicode characters in this range are:
https://en.wikipedia.org/wiki/Latin-1_S ... ode_block)

The "ASCII" characters in this range are, assuming you're on Windows:
https://en.wikipedia.org/wiki/Windows-1252


Strings in PB 5.60 are always Unicode.
#PB_Ascii just tells it to convert to ASCII, as best as possible.
But the Unicode characters $80-$9F don't exist in ASCII, so they become $3F (question mark).

Example: Why doesn't Unicode $80 (Padding Character) just convert to ASCII $80?
Because ASCII $80 is the Euro sign... which actually pairs with Unicode $20AC.

So unmappable Unicode characters (including most > 255) become '?' instead of becoming other different characters.

I believe PB is using the OS's built-in conversion, so if you disagree with the behavior you probably have to write your own procedure.


Hope that makes sense.
User avatar
kenmo
Addict
Addict
Posts: 1967
Joined: Tue Dec 23, 2003 3:54 am

Re: Wierd bytes $3F in the middle of chars 0-255

Post by kenmo »

Same results using Windows API:

Code: Select all

DataSection
  UnicodeBytes:
  Data.u $0041, $0080, $0081 ; Unicode chars $0041, $0080, $0081
  
  AsciiBytes:
  Data.a 0, 0, 0 ; ASCII byte buffer
EndDataSection

Debug Hex(PeekU(?UnicodeBytes + 0))
Debug Hex(PeekU(?UnicodeBytes + 2))
Debug Hex(PeekU(?UnicodeBytes + 4))
Debug ""

; Convert to ASCII using Windows API
WideCharToMultiByte_(#CP_ACP, #Null, ?UnicodeBytes, 3, ?AsciiBytes, 3, #Null, #Null)

Debug Hex(PeekA(?AsciiBytes + 0))
Debug Hex(PeekA(?AsciiBytes + 1))
Debug Hex(PeekA(?AsciiBytes + 2))
firace
Addict
Addict
Posts: 899
Joined: Wed Nov 09, 2011 8:58 am

Re: Wierd bytes $3F in the middle of chars 0-255

Post by firace »

@kenmo Good info, thanks!

And if you just want to write raw values 0 to 255, this should work as expected:

Code: Select all

CreateFile(0,"temp.DAT")

For t=0 To 255
  WriteAsciiCharacter(0, t)
Next t

CloseFile(0)
User avatar
oakvalley
User
User
Posts: 76
Joined: Sun Aug 08, 2004 6:34 pm
Location: Norway
Contact:

Re: Wierd bytes $3F in the middle of chars 0-255

Post by oakvalley »

@kenmo Thanks for the clarification!

So, if I understand this correctly. I thought ASCII is 0-255. And UNICODE was >255 to whatever millions of chars they need to create later.

How they ended up eating a HOLE into standard ASCII 0-255 and basically say "lets do a best fit here instead" is completely ludicrus, the guys
who invented UNICODE sure messed up as far as I can see.

Why couldn't they just let 0-255 be as they always was and continue from there with UNICODE (simply ADD it to the already established table).

Oh well, Ill have to make my own conversion for those unicode bytes and write it out myself in whatever code/memory/file situation I run into as the character I need to have :-)
Regards Stone Oakvalley
Currently @ PB 5.70
User avatar
Josh
Addict
Addict
Posts: 1183
Joined: Sat Feb 13, 2010 3:45 pm

Re: Wierd bytes $3F in the middle of chars 0-255

Post by Josh »

Ascii includes only 0 - 127 and not 0 - 255
Using Ascii, the area 128 - 255 is different for each language or application and depends on the used codepage.

The problems you have, are made by yourself. See your other topic.
sorry for my bad english
User avatar
kenmo
Addict
Addict
Posts: 1967
Joined: Tue Dec 23, 2003 3:54 am

Re: Wierd bytes $3F in the middle of chars 0-255

Post by kenmo »

"There are over a hundred encodings and above code point 127, all bets are off."
This article is 14 years old (!) but still a good read about ASCII, codepages, and Unicode.

https://www.joelonsoftware.com/2003/10/ ... o-excuses/

True "ASCII" is just 0-127. Characters 128-255 were vendor-specific and unreliable, UNTIL Unicode came along and created a global standard.
User avatar
VB6_to_PBx
Enthusiast
Enthusiast
Posts: 617
Joined: Mon May 09, 2011 9:36 am

Re: Wierd bytes $3F in the middle of chars 0-255

Post by VB6_to_PBx »

firace wrote:@kenmo Good info, thanks!

And if you just want to write raw values 0 to 255, this should work as expected:

Code: Select all

CreateFile(0,"temp.DAT")

For t=0 To 255
  WriteAsciiCharacter(0, t)
Next t

CloseFile(0)
i tweaked your Code a little to make it display separate Characters down a page with corresponding number

Code: Select all

;-       WriteAsciiCharacter__v1.pb
;-
;-       Link : http://www.purebasic.fr/english/viewtopic.php?f=13&t=69023
;-       Post Subject/Date : 
;-       Compiler : PB 5.31
;-      
;-< Start Program >------------------------------------------------------------
;
;
CreateFile(0,"C:\PureBASIC\ASCII__and__UniCode__and__UTF_8\WriteAsciiCharacter.txt")  ;<-- type your Drive/Folder/Filename here
					For t=0 To 255
							  WriteAsciiCharacter(0, t) 
							  ; WriteString(0,Space(3) + Str(t) + #CRLF$)   
							  WriteString(0," = " + Str(t) + #CRLF$)   
					Next t
CloseFile(0)
 
PureBasic .... making tiny electrons do what you want !

"With every mistake we must surely be learning" - George Harrison
User avatar
oakvalley
User
User
Posts: 76
Joined: Sun Aug 08, 2004 6:34 pm
Location: Norway
Contact:

Re: Wierd bytes $3F in the middle of chars 0-255

Post by oakvalley »

Yeah, you are all right. ASCII is 0-127, 7-bit. Its just that my memories from the good old Commodore 64 and Amiga steered me into the belief of ASCII 0-255 :-)

Anyway, I solved my problems by using *ascii=Ascii() and then PokeA(*ascii) in PureBasic to get the values I was seeking. I was working with some databases that contained filenames originating from Amiga and just got surprised when there was a "hole" in the daily routine of ASCII chars.
Regards Stone Oakvalley
Currently @ PB 5.70
Post Reply