Page 1 of 2

Http & unicode: Sending strings

Posted: Thu May 25, 2006 5:56 am
by Dare2
Currently, in order to get a server to recognise a POSTed string (set up via HttpSendRequest) I am doing this:

Code: Select all

  PokeS(dta,PeekS(dta,dLen),dLen,#PB_Ascii)
  r = HttpSendRequest_(hOpen,0,0,dta,dLen)
However info sent via HttpAddRequestHeaders, like this:

Code: Select all

  wrk = "Content-Type: application/x-www-form-urlencoded"+Chr(13)+Chr(10) 
  r = HttpAddRequestHeaders_(hOpen, @wrk,Len(wrk), #HTTP_ADDREQ_FLAG_ADD|#HTTP_ADDREQ_FLAG_REPLACE)
works as is.

On retrieving data received from the server into a buffer I use

Code: Select all

  PeekS(buffer,finalTotalCharsAsReportedByOS,#PB_Ascii)
These klurges appear to be working, however I would like to feel secure that this is ok in all circumstances (with strings) and not a fluke.

Also is there some rule here with strings and HTTP, as to when servers need ASCII and when the unicode string is ok? Does this differ server to server (eg, Apache -v- IIS)


Thanks!

Posted: Sat May 27, 2006 10:26 am
by lexvictory
use UTF-8 - thats what's supposed to b used...... :lol:

Posted: Sat May 27, 2006 10:56 am
by Edwin Knoppert
Usually HTTP is not in unicode, i wonder if there is an unicode transfer at all.
Maybe browsers convert two-byte UTF-8 or ASCII (7-bits) symbols to unicode in china and so?
Imo webservers never do any translation.

Maybe someone can shed some light on this :)

Posted: Sat May 27, 2006 1:58 pm
by Dare2
Hi lexvictory,

Thanks for the tip. Can you explain to me how to take the strings I have swimming around in a unicode program and make them UTF8-ed for the API calls? I'm lost with this.

I have workarounds but I fear them as I don't understand what is happening and so am not sure I have "fixed" the issue or just fixed the instances in my testing.

Thanks.

Edit: Hi Edwin, you may be right. In fact, the same prog compiled non-unicode has no issues. But compiled unicode it falls over even when the strings are forced ASCII for out (to server) and I get embedded garbage in the strings coming in (from server). So I am screwing up somewhere. Blowed if I know where, though.

Posted: Sat May 27, 2006 2:27 pm
by lexvictory
UTF-8 is very complicated, in ASCII mode, u have no probs because basically ASCII characters map directly to UTF-8.

for a better explanation, see here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
or here: http://en.wikipedia.org/wiki/UTF-8

as for how do u change the format, use a combination of peeks() and pokes()

as i'm not quite sure which functions u are using, i dont know if i've explained it enough..... :lol:

@Edwin Knoppert: http IS supposed to be sent in UTF-8 (at least HTTP/1.1 is) - just try to use the webserver example is unicode mode..... its very hard to do, and even i havent done it fully.....

Posted: Sat May 27, 2006 3:15 pm
by Edwin Knoppert
Even if you succeed to produce the unicode from the webserver to client you still need to do it according standards.
I assume there is an ordinary ascii header which tells the contents are in unicode.

I have never done that or seen it, i'm just stressing people should follow standards.
I just can not imagne ALL (header)data send from the server is in unicode.
Then it would be odd i can see plain code on some websites while the rest is blocks..

Posted: Sat May 27, 2006 3:22 pm
by lexvictory
Edwin Knoppert wrote:Even if you succeed to produce the unicode from the webserver to client you still need to do it according standards.
I assume there is an ordinary ascii header which tells the contents are in unicode.

I have never done that or seen it, i'm just stressing people should follow standards.
I just can not imagne ALL (header)data send from the server is in unicode.
Then it would be odd i can see plain code on some websites while the rest is blocks..
Header data is sent in ASCII/UTF-8, there is usually NO unicode data in the headers, and since UTF-8 is backwards compatible, you could say the header info is in either.....

and usually the header doesnt say whether the data is unicode or not, most often it is put in the HTML code.... i think sometimes dynamic pages put the encoding in the content-type header....

Posted: Sat May 27, 2006 3:49 pm
by Dare2
Hi lexvictory, stripped code below (all checks, progress calls, etc, pulled out leaving, hopefully, only the significant bits and not too messy despite the missing bits). Treat as psuedo code. :)

Code: Select all

; ---------------------  these bits handled outside subroutine and passed in
;                          either as globals or as parameters.

  netAgent.s = "My Client's 'browser' Name"
  server.s = "127.0.0.1"
  item.s = "/MyVirtualDomain/wwwRoot/someFolder/somePage.asp"
  dta.s = "P1=One&P2=Two"
  dLen = Len(dta)
  *bfr = AllocateMemory(1)
  hdrSz = 1024
  bufSz = 1024

; --------------------- This is the subroutine

; --------------------- Following works as-is, strings accepted as-is

  hInet = InternetOpen_(netAgent, #INTERNET_OPEN_TYPE_DIRECT, #Null,#Null,0)
  hConn = InternetConnect_(hInet,server, #INTERNET_DEFAULT_HTTP_PORT, #Null,#Null, #INTERNET_SERVICE_HTTP,0,0)
  hOpen = HttpOpenRequest_(hConn,"POST",item, #Null,#Null,0, #INTERNET_FLAG_RELOAD,0)
  wrk.s = "Content-Type: application/x-www-form-urlencoded"+Chr(13)+Chr(10) 
  hAdd = HttpAddRequestHeaders_(hOpen,@wrk,Len(wrk),#HTTP_ADDREQ_FLAG_ADD|#HTTP_ADDREQ_FLAG_REPLACE)

; --------------------- Need to klurge post data

  wrk = Space(dLen)
  PokeS(@wrk,PeekS(dta,dLen),dLen,#PB_Ascii)
  hSend = HttpSendRequest_(hOpen,0,0,@wrk,dLen)
  ReAllocateMemory(*bfr,hdrSz)
  i = HttpQueryInfo_(hOpen,#HTTP_QUERY_RAW_HEADERS_CRLF,*bfr,@hdrSz,0)

  If i

; --------------------- Came back as unicode so ASCII it at start of buffer

    wrk = PeekS(*bfr,hdrSz*SizeOf(character))
    PokeS(*bfr,wrk,Len(wrk),#PB_Ascii)
    bufLen = Len(wrk)
    pgSz = Val(httpGetHeader(wrk,"Content-Length:"))
; --------------------- httpGetHeader above is just a procedure that grabs the value

  Else
    pgSz = 0
    bufLen = 0
  EndIf

  rcvd = 0
  Repeat
    *bfr = ReAllocateMemory(*bfr,bufLen+(bufSz))
    InternetReadFile_(hOpen,*bfr + bufLen,bufSz,@rcvd)
    bufLen + rcvd
  Until rcvd = 0
  InternetCloseHandle_(hInet)

; --------------------- 
; Note: The contents of buffer are:
;    the header (unicode PokeS(PeekS())-ed to ascii
;    the page - arrived as ascii

; --------------------- 
; Now get it as ascii into a usable unicode format

  wrk=PeekS(*bfr,i,#PB_Ascii)
So:

Some stuff (like content type, server name and page (item) path) sent as unicode. Post data sent as Ascii. Any other mix fails both on my localhost and with online servers.

Header comes back as unicode (but actually, may have been UTF8 - only parts were garbled). Actual page arrived as plain old Ascii.

I can try using #PB_UTF8 flag with the PeekS/PokeS stuff, but that still has the fiddling around. And as you say, ASCII will work. Also, as you pointed out, it seems that headers are unicode, whilst data (in) and page (back) are not. Haven't tried with a plain old GET yet.

If what I am doing is safe, then I am happy enough with it. I just want to be sure it is safe and not just a happenstance that it is working. :)

If you can help, much appreciated. If you can't, I still appreciate the effort and interest you showed. Thanks! :)

Posted: Mon May 29, 2006 1:31 pm
by lexvictory
ok, will look at code 2morrow.

as i understand it, all header/request data should be in UTF-8, while the page sent can be in any encoding.

ie:
"GET / HTTP/1.1<crlf>
Host: somehost.com<crlf><crlf>" <-- should be in UTF8 (and escaped - ' ' changed to '%20')

"HTTP/1.1 OK<crlf>
Content-type: <content type><crlf>
<other headers><crlf><crlf>" <-- should be in UTF-8
"<page content>" <-- can be in ANY encoding.

although i'm not sure if you need to change anything into utf8 when using the Internet*_() functions, will look at this too.

btw: unicode is F*****G complicated, AND confusing.... :lol: :shock: :?

Posted: Mon May 29, 2006 2:00 pm
by Thalius
Hihi Dare2 =)
All you need to worry about so far is the URL Encoding.

To finish the URL-encoding of the data, perform the following two steps:

1. Change all " " (spaces) in the string to + (plus signs)
2. If you have any funky characters (defined below) in the names or values, replace them with %xx, where xx is the ASCII code of that funky character, in HEX. Do NOT replace the & and = signs that separate the name-value pairs.
Funky characters are &, =, +, %, and ?. Replace them with %26, %3D, %2B, %25, and %3F respectively. If any other characters might give you trouble, escape them in the same way-- there is no danger from escaping too many characters.

Anotehr Caution:
The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).

Note: Servers ought to be cautious about depending on URI lengths
above 255 bytes, because some older client or proxy
implementations might not properly support these lengths.


RFC *snip*
RFC 1738 Uniform Resource Locators (URL) December 1994


alpha = lowalpha | hialpha
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation = "<" | ">" | "#" | "%" | <">


reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
escape = "%" hex hex

unreserved = alpha | digit | safe | extra
uchar = unreserved | escape
xchar = unreserved | reserved | escape
digits = 1*digit
Greets, Thalius

Posted: Mon May 29, 2006 2:34 pm
by Dare
Hi lexvictory,

Thanks, but don't go overboard and waste your time with this, please. It will give me a guilt trip. :)

And yes, extremely complicated and confusing is right. Thank goodness Pure Four does most of the work! :D


Hi Thalius,

Thanks for the tip. Actually I have an encoding subroutine which does the % escape thing but I only every used it with query strings - sometimes. With IIS (or perhaps it with Internet*_ stuff?) you can get away with it if you don't use it, it seems?

The issue I was having is in the headers, both ways (if post data is a header)

(BTW, you need to de-flea your pussy.) :)
Edit: Too late, the pure creature has gone. :) (Thank goodness) :D


So, as an aside from the peekS/pokeS jiggling, you guys are saying also escape everything in the headers going to the server?

Thanks again!

Posted: Mon May 29, 2006 2:45 pm
by lexvictory
> Thanks, but don't go overboard and waste your time with this, please. It will give me a guilt trip.
Dont worry, I want to do something like this also.....

>So, as an aside from the peekS/pokeS jiggling, you guys are saying also escape everything in the headers going to the server?
No, just the url (including query string) and the postdata

note that in query strings, there are 2 methods of escaping spaces, '+' and '%20' - try it with google :D

Posted: Mon May 29, 2006 2:47 pm
by lexvictory
lexvictory wrote:ie:
"GET / HTTP/1.1<crlf>
Host: somehost.com<crlf><crlf>" <-- should be in UTF8 (and escaped - ' ' changed to '%20')
should be
"GET / HTTP/1.1<crlf>
Host: somehost.com<crlf><crlf>" <-- should be in UTF8 (and URL escaped - ' ' changed to '%20', etc)

Posted: Mon May 29, 2006 2:53 pm
by Thalius
As far as i recall the Request Headers are standard UTF-8 ( non-unicode ). so should be the content ( well ok content van be anything you can also add an encoding type param in the header ).

Need more code to fiddle...
Am atm writing a network lib for games ( TCP- Client /Server )( not using api calls only PB ) - tuff bugger...

mmhm i re3alise this is huge and RFC-descriptions suck.. but this might be informative.. =)

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

Cheers,

Thalius

Posted: Mon May 29, 2006 3:00 pm
by lexvictory
hey, just do it with out escaping... IE does sometimes..... :lol: :lol:
Only joking... :D