Http & unicode: Sending strings

Just starting out? Need help? Post your questions and find answers here.
Dare2
Moderator
Moderator
Posts: 3321
Joined: Sat Dec 27, 2003 3:55 am
Location: Great Southern Land

Http & unicode: Sending strings

Post by Dare2 »

Currently, in order to get a server to recognise a POSTed string (set up via HttpSendRequest) I am doing this:

Code: Select all

  PokeS(dta,PeekS(dta,dLen),dLen,#PB_Ascii)
  r = HttpSendRequest_(hOpen,0,0,dta,dLen)
However info sent via HttpAddRequestHeaders, like this:

Code: Select all

  wrk = "Content-Type: application/x-www-form-urlencoded"+Chr(13)+Chr(10) 
  r = HttpAddRequestHeaders_(hOpen, @wrk,Len(wrk), #HTTP_ADDREQ_FLAG_ADD|#HTTP_ADDREQ_FLAG_REPLACE)
works as is.

On retrieving data received from the server into a buffer I use

Code: Select all

  PeekS(buffer,finalTotalCharsAsReportedByOS,#PB_Ascii)
These klurges appear to be working, however I would like to feel secure that this is ok in all circumstances (with strings) and not a fluke.

Also is there some rule here with strings and HTTP, as to when servers need ASCII and when the unicode string is ok? Does this differ server to server (eg, Apache -v- IIS)


Thanks!
@}--`--,-- A rose by any other name ..
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

use UTF-8 - thats what's supposed to b used...... :lol:
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
Edwin Knoppert
Addict
Addict
Posts: 1073
Joined: Fri Apr 25, 2003 11:13 pm
Location: Netherlands
Contact:

Post by Edwin Knoppert »

Usually HTTP is not in unicode, i wonder if there is an unicode transfer at all.
Maybe browsers convert two-byte UTF-8 or ASCII (7-bits) symbols to unicode in china and so?
Imo webservers never do any translation.

Maybe someone can shed some light on this :)
Dare2
Moderator
Moderator
Posts: 3321
Joined: Sat Dec 27, 2003 3:55 am
Location: Great Southern Land

Post by Dare2 »

Hi lexvictory,

Thanks for the tip. Can you explain to me how to take the strings I have swimming around in a unicode program and make them UTF8-ed for the API calls? I'm lost with this.

I have workarounds but I fear them as I don't understand what is happening and so am not sure I have "fixed" the issue or just fixed the instances in my testing.

Thanks.

Edit: Hi Edwin, you may be right. In fact, the same prog compiled non-unicode has no issues. But compiled unicode it falls over even when the strings are forced ASCII for out (to server) and I get embedded garbage in the strings coming in (from server). So I am screwing up somewhere. Blowed if I know where, though.
@}--`--,-- A rose by any other name ..
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

UTF-8 is very complicated, in ASCII mode, u have no probs because basically ASCII characters map directly to UTF-8.

for a better explanation, see here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
or here: http://en.wikipedia.org/wiki/UTF-8

as for how do u change the format, use a combination of peeks() and pokes()

as i'm not quite sure which functions u are using, i dont know if i've explained it enough..... :lol:

@Edwin Knoppert: http IS supposed to be sent in UTF-8 (at least HTTP/1.1 is) - just try to use the webserver example is unicode mode..... its very hard to do, and even i havent done it fully.....
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
Edwin Knoppert
Addict
Addict
Posts: 1073
Joined: Fri Apr 25, 2003 11:13 pm
Location: Netherlands
Contact:

Post by Edwin Knoppert »

Even if you succeed to produce the unicode from the webserver to client you still need to do it according standards.
I assume there is an ordinary ascii header which tells the contents are in unicode.

I have never done that or seen it, i'm just stressing people should follow standards.
I just can not imagne ALL (header)data send from the server is in unicode.
Then it would be odd i can see plain code on some websites while the rest is blocks..
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

Edwin Knoppert wrote:Even if you succeed to produce the unicode from the webserver to client you still need to do it according standards.
I assume there is an ordinary ascii header which tells the contents are in unicode.

I have never done that or seen it, i'm just stressing people should follow standards.
I just can not imagne ALL (header)data send from the server is in unicode.
Then it would be odd i can see plain code on some websites while the rest is blocks..
Header data is sent in ASCII/UTF-8, there is usually NO unicode data in the headers, and since UTF-8 is backwards compatible, you could say the header info is in either.....

and usually the header doesnt say whether the data is unicode or not, most often it is put in the HTML code.... i think sometimes dynamic pages put the encoding in the content-type header....
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
Dare2
Moderator
Moderator
Posts: 3321
Joined: Sat Dec 27, 2003 3:55 am
Location: Great Southern Land

Post by Dare2 »

Hi lexvictory, stripped code below (all checks, progress calls, etc, pulled out leaving, hopefully, only the significant bits and not too messy despite the missing bits). Treat as psuedo code. :)

Code: Select all

; ---------------------  these bits handled outside subroutine and passed in
;                          either as globals or as parameters.

  netAgent.s = "My Client's 'browser' Name"
  server.s = "127.0.0.1"
  item.s = "/MyVirtualDomain/wwwRoot/someFolder/somePage.asp"
  dta.s = "P1=One&P2=Two"
  dLen = Len(dta)
  *bfr = AllocateMemory(1)
  hdrSz = 1024
  bufSz = 1024

; --------------------- This is the subroutine

; --------------------- Following works as-is, strings accepted as-is

  hInet = InternetOpen_(netAgent, #INTERNET_OPEN_TYPE_DIRECT, #Null,#Null,0)
  hConn = InternetConnect_(hInet,server, #INTERNET_DEFAULT_HTTP_PORT, #Null,#Null, #INTERNET_SERVICE_HTTP,0,0)
  hOpen = HttpOpenRequest_(hConn,"POST",item, #Null,#Null,0, #INTERNET_FLAG_RELOAD,0)
  wrk.s = "Content-Type: application/x-www-form-urlencoded"+Chr(13)+Chr(10) 
  hAdd = HttpAddRequestHeaders_(hOpen,@wrk,Len(wrk),#HTTP_ADDREQ_FLAG_ADD|#HTTP_ADDREQ_FLAG_REPLACE)

; --------------------- Need to klurge post data

  wrk = Space(dLen)
  PokeS(@wrk,PeekS(dta,dLen),dLen,#PB_Ascii)
  hSend = HttpSendRequest_(hOpen,0,0,@wrk,dLen)
  ReAllocateMemory(*bfr,hdrSz)
  i = HttpQueryInfo_(hOpen,#HTTP_QUERY_RAW_HEADERS_CRLF,*bfr,@hdrSz,0)

  If i

; --------------------- Came back as unicode so ASCII it at start of buffer

    wrk = PeekS(*bfr,hdrSz*SizeOf(character))
    PokeS(*bfr,wrk,Len(wrk),#PB_Ascii)
    bufLen = Len(wrk)
    pgSz = Val(httpGetHeader(wrk,"Content-Length:"))
; --------------------- httpGetHeader above is just a procedure that grabs the value

  Else
    pgSz = 0
    bufLen = 0
  EndIf

  rcvd = 0
  Repeat
    *bfr = ReAllocateMemory(*bfr,bufLen+(bufSz))
    InternetReadFile_(hOpen,*bfr + bufLen,bufSz,@rcvd)
    bufLen + rcvd
  Until rcvd = 0
  InternetCloseHandle_(hInet)

; --------------------- 
; Note: The contents of buffer are:
;    the header (unicode PokeS(PeekS())-ed to ascii
;    the page - arrived as ascii

; --------------------- 
; Now get it as ascii into a usable unicode format

  wrk=PeekS(*bfr,i,#PB_Ascii)
So:

Some stuff (like content type, server name and page (item) path) sent as unicode. Post data sent as Ascii. Any other mix fails both on my localhost and with online servers.

Header comes back as unicode (but actually, may have been UTF8 - only parts were garbled). Actual page arrived as plain old Ascii.

I can try using #PB_UTF8 flag with the PeekS/PokeS stuff, but that still has the fiddling around. And as you say, ASCII will work. Also, as you pointed out, it seems that headers are unicode, whilst data (in) and page (back) are not. Haven't tried with a plain old GET yet.

If what I am doing is safe, then I am happy enough with it. I just want to be sure it is safe and not just a happenstance that it is working. :)

If you can help, much appreciated. If you can't, I still appreciate the effort and interest you showed. Thanks! :)
@}--`--,-- A rose by any other name ..
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

ok, will look at code 2morrow.

as i understand it, all header/request data should be in UTF-8, while the page sent can be in any encoding.

ie:
"GET / HTTP/1.1<crlf>
Host: somehost.com<crlf><crlf>" <-- should be in UTF8 (and escaped - ' ' changed to '%20')

"HTTP/1.1 OK<crlf>
Content-type: <content type><crlf>
<other headers><crlf><crlf>" <-- should be in UTF-8
"<page content>" <-- can be in ANY encoding.

although i'm not sure if you need to change anything into utf8 when using the Internet*_() functions, will look at this too.

btw: unicode is F*****G complicated, AND confusing.... :lol: :shock: :?
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
Thalius
Enthusiast
Enthusiast
Posts: 711
Joined: Thu Jul 17, 2003 4:15 pm
Contact:

Post by Thalius »

Hihi Dare2 =)
All you need to worry about so far is the URL Encoding.

To finish the URL-encoding of the data, perform the following two steps:

1. Change all " " (spaces) in the string to + (plus signs)
2. If you have any funky characters (defined below) in the names or values, replace them with %xx, where xx is the ASCII code of that funky character, in HEX. Do NOT replace the & and = signs that separate the name-value pairs.
Funky characters are &, =, +, %, and ?. Replace them with %26, %3D, %2B, %25, and %3F respectively. If any other characters might give you trouble, escape them in the same way-- there is no danger from escaping too many characters.

Anotehr Caution:
The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).

Note: Servers ought to be cautious about depending on URI lengths
above 255 bytes, because some older client or proxy
implementations might not properly support these lengths.


RFC *snip*
RFC 1738 Uniform Resource Locators (URL) December 1994


alpha = lowalpha | hialpha
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation = "<" | ">" | "#" | "%" | <">


reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
escape = "%" hex hex

unreserved = alpha | digit | safe | extra
uchar = unreserved | escape
xchar = unreserved | reserved | escape
digits = 1*digit
Greets, Thalius
"In 3D there is never enough Time to do Things right,
but there's always enough Time to make them *look* right."
"psssst! i steal signatures... don't tell anyone! ;)"
Dare
Addict
Addict
Posts: 1965
Joined: Mon May 29, 2006 1:01 am
Location: Outback

Post by Dare »

Hi lexvictory,

Thanks, but don't go overboard and waste your time with this, please. It will give me a guilt trip. :)

And yes, extremely complicated and confusing is right. Thank goodness Pure Four does most of the work! :D


Hi Thalius,

Thanks for the tip. Actually I have an encoding subroutine which does the % escape thing but I only every used it with query strings - sometimes. With IIS (or perhaps it with Internet*_ stuff?) you can get away with it if you don't use it, it seems?

The issue I was having is in the headers, both ways (if post data is a header)

(BTW, you need to de-flea your pussy.) :)
Edit: Too late, the pure creature has gone. :) (Thank goodness) :D


So, as an aside from the peekS/pokeS jiggling, you guys are saying also escape everything in the headers going to the server?

Thanks again!
Dare2 cut down to size
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

> Thanks, but don't go overboard and waste your time with this, please. It will give me a guilt trip.
Dont worry, I want to do something like this also.....

>So, as an aside from the peekS/pokeS jiggling, you guys are saying also escape everything in the headers going to the server?
No, just the url (including query string) and the postdata

note that in query strings, there are 2 methods of escaping spaces, '+' and '%20' - try it with google :D
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

lexvictory wrote:ie:
"GET / HTTP/1.1<crlf>
Host: somehost.com<crlf><crlf>" <-- should be in UTF8 (and escaped - ' ' changed to '%20')
should be
"GET / HTTP/1.1<crlf>
Host: somehost.com<crlf><crlf>" <-- should be in UTF8 (and URL escaped - ' ' changed to '%20', etc)
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
Thalius
Enthusiast
Enthusiast
Posts: 711
Joined: Thu Jul 17, 2003 4:15 pm
Contact:

Post by Thalius »

As far as i recall the Request Headers are standard UTF-8 ( non-unicode ). so should be the content ( well ok content van be anything you can also add an encoding type param in the header ).

Need more code to fiddle...
Am atm writing a network lib for games ( TCP- Client /Server )( not using api calls only PB ) - tuff bugger...

mmhm i re3alise this is huge and RFC-descriptions suck.. but this might be informative.. =)

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

Cheers,

Thalius
"In 3D there is never enough Time to do Things right,
but there's always enough Time to make them *look* right."
"psssst! i steal signatures... don't tell anyone! ;)"
lexvictory
Addict
Addict
Posts: 1027
Joined: Sun May 15, 2005 5:15 am
Location: Australia
Contact:

Post by lexvictory »

hey, just do it with out escaping... IE does sometimes..... :lol: :lol:
Only joking... :D
Demonio Ardente

Currently managing Linux & OS X Tailbite
OS X TailBite now up to date with Windows!
Post Reply