ReceiveHTTPFile() vs WebGadget()

Just starting out? Need help? Post your questions and find answers here.
BarryG
Addict
Addict
Posts: 4219
Joined: Thu Apr 18, 2019 8:17 am

ReceiveHTTPFile() vs WebGadget()

Post by BarryG »

Here's an URL where I'm trying to download the HTML code:

https://www.realestate.com.au/buy/in-ro ... 226/list-1

When I use ReceiveHTTPFile() with it, I get these characters in the downloaded file:

Image

I was just doing this:

Code: Select all

ReceiveHTTPFile("https://www.realestate.com.au/buy/in-robina,+qld+4226/list-1","D:\ReceiveHTTPFile.txt")
But when I use a WebGadget() with #PB_Web_HtmlCode, the gotten text looks like this:

Image

Why the difference? It seems to me that I can't rely on ReceiveHTTPFile() to get the HTML of web pages?

Here's the CopyURLText() code that I'm using:

Code: Select all

Procedure WaitForWebGadget(gad)
  Browser.IWebBrowser2=GetWindowLongPtr_(GadgetID(gad),#GWLP_USERDATA)
  Repeat
    While WindowEvent() : Wend
    Browser\get_Busy(@busy.l)
    If busy=#VARIANT_TRUE : Sleep_(1) : EndIf
  Until busy=#VARIANT_FALSE
EndProcedure

Procedure.s CopyURLText(url$)
  w=OpenWindow(#PB_Any,0,0,0,0,"",#PB_Window_Invisible)
  If w
    g=WebGadget(#PB_Any,0,0,0,0,url$)
    WaitForWebGadget(g)
    text$=GetGadgetItemText(g,#PB_Web_HtmlCode)
    CloseWindow(w)
  EndIf
  ProcedureReturn text$
EndProcedure

Debug CopyURLText("https://www.realestate.com.au/buy/in-robina,+qld+4226/list-1")
juergenkulow
Enthusiast
Enthusiast
Posts: 581
Joined: Wed Sep 25, 2019 10:18 am

Re: ReceiveHTTPFile() vs WebGadget()

Post by juergenkulow »

The web page returns different garbage on each ReceiveHTTPFile() call:

Code: Select all

00000000  1F 8B 08 00 00 00 00 00 00 03 5D 51 DB 52 A3 40  .‹........]QÛR£@
00000010  10 FD 95 E8 83 9A 07 C2 70 9B C0 0A 58 91 11 B3  .ý•胚.Âp›À.X‘.³
00000020  10 43 36 12 02 79 B1 06 66 48 48 C2 C5 01 C4 B8  .C6..y±.fHHÂÅ.ĸ
00000030  B5 FF BE AB A6 D6 94 2F DD 7D FA 54 F7 39 D5 AD  µÿ¾«¦Ö”/Ý}úT÷9Õ­
00000040  9F 21 CF F2 A3 D9 5D 6F D3 E4 7B 53 3F 46 8A 89  Ÿ!Ïò£Ù]oÓä{S?FЉ
00000050  A9 F3 9F 29 2E C9 C1 D4 EB 84 65 55 63 76 59 41  ©óŸ).ÉÁÔë„eUcvYA
00000060  CA 6E E0 CE 1E 91 6B FC FE 73 FD 51 0C 8A B2 33  ÊnàÎ.‘küþsýQ.в3
00000070  9A 43 45 CB B4 57 51 96 96 2C C7 45 42 CF 0C E3  šCEË´WQ––,ÇEBÏ.ã
00000080  B2 2D 08 4D B3 82 92 CB 8B 8B 13 EA 7D E2 E6 1B  ²-.M³‚’Ë‹‹.ê}âæ.
00000090  1E C4 FF 76 5F 9D 34 FB 3F 10 6E 4E A8 77 D4 3F  .Äÿv_4û?.nN¨wÔ?
000000A0  2A D6 0D 66 8D F1 5F FD AA 7F AD F3 47 8B 47 AB  *Ö.fñ_ýª­óG‹G«
000000B0  BD 9A 25 C6 39 2F C8 1A D5 14 41 E2 80 90 62 4E  ½š%Æ9/È.Õ.A‐bN
000000C0  4E 63 C0 61 4C 64 4E 81 10 A7 64 28 2A 44 88 79  NcÀaLdN.§d(*Dˆy
000000D0  91 88 00 62 49 E3 54 4A 86 9C 2C 0D 29 87 A5 98  ‘ˆ.bIãTJ†œ,.)‡¥˜
000000E0  72 2A 14 29 48 01 A4 14 4B 7C 56 D5 83 6D 7D E3  r*.)H.¤.K|VՃm}ã
000000F0  CE 9E 16 3F D1 9B 01 C4 FA 15 77 B6 CD B6 5E 28  Ξ.?ћ.Äú.w¶Í¶^(
00000100  30 B8 1E CA 2F FB D7 79 B4 7E F3 D5 97 58 CE DB  0¸.Ê/û×y´~ó՗XÎÛ
00000110  4D 3C 62 6A EE 80 0D 5A B5 72 B6 08 05 CB 4E D0  M<bjî€.Zµr¶..ËNÐ
00000120  AE 15 76 CE AE 43 C4 9E 36 78 B2 B0 54 3C 06 CE  ®.vήCĞ6x²°T<.Î
00000130  43 28 07 BF 6C 37 18 3D AF 44 CD 27 2E F3 CB 62  C(.¿l7.=¯DÍ'.óËb
00000140  59 B9 4A 1B 46 AB 7B AB D0 C6 DE 92 39 EB FD F3  Y¹J.F«{«ÐÆÞ’9ëýó
00000150  6D 32 0D B6 6B DB DB C0 C9 01 DE A9 A2 03 9A 68  m2.¶kÛÛÀÉ.Þ©¢.šh
00000160  3A A9 B7 04 05 63 3B 9B 43 04 24 7F 19 38 41 6E  :©·..c;›C.$.8An
00000170  55 B0 8D C3 D0 F2 1E EE 85 68 32 5F 84 A3 C7 31  U°ÃÐò.î…h2_„£Ç1
00000180  25 EA 61 A5 A0 72 0E 85 73 F3 EB 56 FC E7 77 F9  %êa¥ r.…sóëVüçwù
00000190  8F BF FF 05 78 A1 D2 73 0D 02 00 00 00           ¿ÿ.x¡Òs.....

Code: Select all

00000000  1F 8B 08 00 00 00 00 00 00 03 5D 51 5D 77 9A 40  .‹........]Q]wš@
00000010  10 FD 2B 26 0F 49 7C 40 3E 16 56 69 80 1C 04 6D  .ý+&.I|@>.Vi€..m
00000020  11 A2 50 B0 8A 2F 3D BB EC 12 D0 B8 28 A0 40 7B  .¢P°Š/=»ì.и( @{
00000030  FA DF DB 24 9E D6 D3 97 99 B9 73 CF CC BD 67 46  úßÛ$žÖӗ™¹sÏ̽gF
00000040  BB B1 17 56 14 FB 93 5E 56 EF 5F 0D ED 12 29 22  »±.V.û“^Vï_.í.)"
00000050  86 C6 7F 24 5C 90 CE D0 AA A4 CC 0F B5 D1 E4 8C  †Æ$\ÎЪ¤Ì.µÑäŒ
00000060  14 CD C0 F5 43 DB D5 7F FE 7A 7C 2F 06 AC 68 F4  .ÍÀõCÛÕþz|/.¬hô
00000070  BA 3B D0 22 ED 1D 68 99 16 E5 1E B1 84 DE E8 FA  º;Ð"í.h™.å.±„Þèú
00000080  FD 89 11 9A E6 8C 92 FB BB BB 2B EA 6D E2 E9 3F  ý‰.šæŒ’û»»+êmâé?
00000090  3C C0 7F 76 3F 5C 35 FB 9F 6C 54 5F 51 6F A8 7F  <Àv?\5ûŸlT_Qo¨
000000A0  51 AC 6A 54 D6 FA 5F F5 87 FE A3 C6 5F 2C 5E AC  Q¬jTÖú_õ‡þ£Æ_,^¬
000000B0  F6 AA 32 D1 6F 79 51 56 A9 AA 88 80 13 C4 14 71  öª2ÑoyQV©ªˆ€.Ä.q
000000C0  72 8A 05 0E 21 22 73 0A 84 28 25 43 49 21 22 E6  rŠ..!"s.„(%CI!"æ
000000D0  25 22 09 10 01 95 1B 51 32 E4 64 30 A4 1C 02 98  %"...•.Q2äd0¤..˜
000000E0  72 23 28 51 21 15 20 A5 08 F0 F9 A1 1A 6C AB 27  r#(Q!. ¥.ðù¡.l«'
000000F0  D7 FF BE 74 EC 1F BA 20 C6 47 73 D5 A9 D6 0A 81  ×ÿ¾tì.º ÆGsÕ©Ö.
00000100  BD B2 2A BB AD EB 2C A1 67 3E 8F 32 78 5A 37 F1  ½²*»­ë,¡g>2xZ7ñ
00000110  A8 61 D2 B2 FD DC 4E AC 2F 47 0B 7F AD 44 EF B8  ¨aÒ²ýÜN¬/G.­Dï¸
00000120  80 5B 93 65 D4 F5 F6 60 3B 3F C7 63 02 5B 3A DD  €[“eÔõö`;?Çc.[:Ý
00000130  55 F8 CC 82 6F 81 89 37 EB D3 2C AE FC 5D BC 72  UøÌ‚o‰7ëÓ,®ü]¼r
00000140  40 56 EC 64 5F 7D 16 D3 4D 34 5B B4 DE BA 48 52  @Vìd_}.ÓM4[´ÞºHR
00000150  C5 73 95 17 98 75 62 1E 9F 33 7B EA 05 9B FC 34  Ås•.˜ub.Ÿ3{ê.›ü4
00000160  AB 1C 16 46 65 31 C6 0D 08 97 8E 4B A2 20 9E CC  «..Fe1Æ..—ŽK¢ žÌ
00000170  C3 F1 8C 59 34 AC CB D7 97 78 9A 84 72 30 87 38  ÃñŒY4¬Ëחxš„r0‡8
00000180  02 D8 37 C3 70 33 6E 93 5B E3 DF AD F8 8F EF F2  .Ø7Ãp3n“[ãß­øïò
00000190  EF 7F FF 0D 86 B3 11 CC 0D 02 00 00 00           ïÿ.†³.Ì.....
Please ask your questions, because switch on the cognition apparatus decides on the only known life in the universe.Wersten :DDüsseldorf NRW Germany Europe Earth Solar System Flake Bubble Orionarm
Milky Way Local_Group Virgo Supercluster Laniakea Universe
infratec
Always Here
Always Here
Posts: 7662
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: ReceiveHTTPFile() vs WebGadget()

Post by infratec »

It is not garbage.

If you also look at the header of the response, you can find:
HTTP/2 429
content-encoding: gzip
content-type: text/html; charset=utf-8
p3p: CP="This site does not specify a policy in the P3P header"
x-kpsdk-ct: 03YVASmMNBsbljNsAOKYtWKcBDvJY5Dzi6b8cEAAwYoCx6AvkSzGdydn9yyvylLRcDjsNs2v4DB6CjGMeCQZN56OSiTxpgCtZQZsKtLldh70BNfAjkaz6yDKYLTTWD2dpDY4fZxKkLEI0aScFZaK7C5KwzQ
content-length: 382
vary: Accept-Encoding
expires: Thu, 29 Dec 2022 09:30:28 GMT
cache-control: max-age=0, no-cache, no-store
pragma: no-cache
date: Thu, 29 Dec 2022 09:30:28 GMT
set-cookie: KP_UIDz-ssn=03YVASmMNBsbljNsAOKYtWKcBDvJY5Dzi6b8cEAAwYoCx6AvkSzGdydn9yyvylLRcDjsNs2v4DB6CjGMeCQZN56OSiTxpgCtZQZsKtLldh70BNfAjkaz6yDKYLTTWD2dpDY4fZxKkLEI0aScFZaK7C5KwzQ; Max-Age=86400; Path=/; Expires=Fri, 30 Dec 2022 09:30:28 GMT; HttpOnly; Secure; SameSite=None
set-cookie: KP_UIDz=03YVASmMNBsbljNsAOKYtWKcBDvJY5Dzi6b8cEAAwYoCx6AvkSzGdydn9yyvylLRcDjsNs2v4DB6CjGMeCQZN56OSiTxpgCtZQZsKtLldh70BNfAjkaz6yDKYLTTWD2dpDY4fZxKkLEI0aScFZaK7C5KwzQ; Max-Age=86400; Path=/; Expires=Fri, 30 Dec 2022 09:30:28 GMT; HttpOnly
set-cookie: reauid=cfb51002a17c0000345ead63a30000005cb70900; expires=Mon, 31-Dec-2038 23:59:59 GMT; path=/; domain=.realestate.com.au
set-cookie: lew_mfe=true; path=/
set-cookie: lew_mfe_rn=72; path=/
set-cookie: Country=DE; path=/; domain=.realestate.com.au
content-security-policy: upgrade-insecure-requests;
The main point is:
content-encoding: gzip
So you have to rename the downloaded file to *.gz
Then you can open it with a filemanager who can handle gz.

You will find:
<!DOCTYPE html><html><head></head><body><script>window.KPSDK={};KPSDK.now=typeof performance!=='undefined'&&performance.now?performance.now.bind(performance):Date.now.bind(Date);KPSDK.start=KPSDK.now();</script><script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP_UIDz=03XtUXRScEW86YxzZCqKbfksTfDbaeYFpAXuqzVy0hCxzcTnUmvFlJaMt9NOxPaFg0lRA0E07wCp9wjKe9XMIJOIGMkLtVVbyFk9PKTj8JuSorIZurwxOFK1Aucx9v2JOe7s1IdAnvjuINlLcKNftUOdSY5"></script></body></html>
User avatar
NicTheQuick
Addict
Addict
Posts: 1527
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReceiveHTTPFile() vs WebGadget()

Post by NicTheQuick »

Sounds like a bug to me. ReceiveHTTPFile() should unzip the content automatically.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
BarryG
Addict
Addict
Posts: 4219
Joined: Thu Apr 18, 2019 8:17 am

Re: ReceiveHTTPFile() vs WebGadget()

Post by BarryG »

infratec wrote: Thu Dec 29, 2022 10:36 amSo you have to rename the downloaded file to *.gz
Then you can open it with a filemanager who can handle gz.
I can't ask my users to do that. I'll have to use the WebGadget() method instead.
infratec
Always Here
Always Here
Posts: 7662
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: ReceiveHTTPFile() vs WebGadget()

Post by infratec »

No, it is not a bug.

It downloads the file as requested.
What you do with the file is your decision.

If you download a jpg file, what should PB do ???
You need an image viewer to show the content.

If you download a zip file, you need a packer to show the content.

Same thing for gzip files.

Definately no bug.

But you need to know what you get.
User avatar
NicTheQuick
Addict
Addict
Posts: 1527
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReceiveHTTPFile() vs WebGadget()

Post by NicTheQuick »

infratec wrote: Thu Dec 29, 2022 11:37 am No, it is not a bug.

It downloads the file as requested.
What you do with the file is your decision.

If you download a jpg file, what should PB do ???
You need an image viewer to show the content.

If you download a zip file, you need a packer to show the content.

Same thing for gzip files.

Definately no bug.

But you need to know what you get.
It is a bug.

You are not downloading a gzip file. You are downloading a text/html file and the content is encoded using gzip. That's a simple HTTP mechanism. How should you know about that if you do not see what is written in the HTTP header? ReceiveHTTPFile() simply has to unpack the payload before writing it to disk.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
infratec
Always Here
Always Here
Posts: 7662
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: ReceiveHTTPFile() vs WebGadget()

Post by infratec »

If I follow your way, how can I get the gzip content if I need it in that format for further processing?
Should I gzip it again after download?

What does curl.exe if you use it for this page?

Code: Select all

curl --output testfile https://www.realestate.com.au/buy/in-robina,+qld+4226/list-1
It stores the content as it is. (like PB)

So also curl has a bug.
User avatar
NicTheQuick
Addict
Addict
Posts: 1527
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReceiveHTTPFile() vs WebGadget()

Post by NicTheQuick »

It seems like this specific website always compresses its response regardless what the `accept-encoding` header tells it to deliver.

curl works as expected if you add the option `--compressed`.

Code: Select all

nicolas@Rocky:~/tmp/purebasic$ curl --compressed https://www.realestate.com.au/buy/in-robina,+qld+4226/list-1 -o -
<!DOCTYPE html><html><head></head><body><script>window.KPSDK={};KPSDK.now=typeof performance!=='undefined'&&performance.now?performance.now.bind(performance):Date.now.bind(Date);KPSDK.start=KPSDK.now();</script><script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP_UIDz=078TbFQaMt7McRkvRJEsvezkOt7upfuhnkbB5TMqKy7urbROzcke1B59mcRZugIuvilzGb4xSOjmd2v2LmKK6uSyBR6etNsfgZz399Fj6hum8OpamxbyADx3tFdAKn1x9aT07Ls80i0qexNas1mrxORiS8VmffRCkaCQnME2IkAS2Xz4QRjceCLQVSe"></script></body></html>
Even if you use the "identity" for "Accept-Encoding" the server returns gzip data. So it seems the server is wrongly configured.

Code: Select all

NewMap Header.s()
Header("Accept-Encoding") = "identity"
Header("User-Agent") = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"

HttpRequest = HTTPRequest(#PB_HTTP_Get, "https://www.realestate.com.au/buy/in-robina,+qld+4226/list-1", "", 0, Header())
If HttpRequest
	Debug "StatusCode: " + HTTPInfo(HTTPRequest, #PB_HTTP_StatusCode)
	Debug "Response: " + HTTPInfo(HTTPRequest, #PB_HTTP_Response)
	Debug "Headers:"
	Debug HTTPInfo(HTTPRequest, #PB_HTTP_Headers)
	
	FinishHTTP(HTTPRequest)
Else
	Debug "Request creation failed"
EndIf
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
juergenkulow
Enthusiast
Enthusiast
Posts: 581
Joined: Wed Sep 25, 2019 10:18 am

Re: ReceiveHTTPFile() vs WebGadget()

Post by juergenkulow »

How do I get the view-source of list-1 with ReceiveHTTPFile or curl?

Code: Select all

view-source:https://www.realestate.com.au/buy/in-robina,+qld+4226/list-1
Post Reply