Parsing a web page using the Wininet api
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Justin.
i know how to download a file(or web page) and write it to disk using the wininet, the file is copied to a buffer, how can i parse it and search for some links?, there are no functions to parse a buffer, how to copy the buffer to a string and use the str functions?(without writing the file to disk)
here is a snippet of my code:
tophandle=InternetOpen_("Microsoft Internet Explorer",#INTERNET_OPEN_TYPE_DIRECT,"","",0)
connecthandle=InternetConnect_(tophandle,host,#INTERNET_DEFAULT_HTTP_PORT,"","",#INTERNET_SERVICE_HTTP,0,0)
openreqhandle=HttpOpenRequest_(connecthandle,"GET",geturl,"","",0,#INTERNET_FLAG_NO_CACHE_WRITE,0)
sendreq=HttpSendRequest_(openreqhandle,"",0,"",0)
buf=AllocateMemory(1,100000,0) ;stores the web page
buf2=AllocateMemory(2,4,0) ;stores the bytes read
r=InternetReadFile_(openreqhandle,buf,100000,buf2)
i know how to download a file(or web page) and write it to disk using the wininet, the file is copied to a buffer, how can i parse it and search for some links?, there are no functions to parse a buffer, how to copy the buffer to a string and use the str functions?(without writing the file to disk)
here is a snippet of my code:
tophandle=InternetOpen_("Microsoft Internet Explorer",#INTERNET_OPEN_TYPE_DIRECT,"","",0)
connecthandle=InternetConnect_(tophandle,host,#INTERNET_DEFAULT_HTTP_PORT,"","",#INTERNET_SERVICE_HTTP,0,0)
openreqhandle=HttpOpenRequest_(connecthandle,"GET",geturl,"","",0,#INTERNET_FLAG_NO_CACHE_WRITE,0)
sendreq=HttpSendRequest_(openreqhandle,"",0,"",0)
buf=AllocateMemory(1,100000,0) ;stores the web page
buf2=AllocateMemory(2,4,0) ;stores the bytes read
r=InternetReadFile_(openreqhandle,buf,100000,buf2)
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by PB.
I'm sure someone else will reply with an answer, but looking at your code I can't
help wondering if it would work on systems where Internet Explorer is NOT installed?
Also, see this tip: viewtopic.php?t=628
PB - Registered PureBasic Coder
I'm sure someone else will reply with an answer, but looking at your code I can't
help wondering if it would work on systems where Internet Explorer is NOT installed?
Also, see this tip: viewtopic.php?t=628
PB - Registered PureBasic Coder
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Justin.
thank you, but that function copies the file to disk, i'm looking for a way of parsing a web page in memory. Is that possible in pb?
thank you, but that function copies the file to disk, i'm looking for a way of parsing a web page in memory. Is that possible in pb?
I think the wininet is independent of Internet Explorer, the 1st parameter of the internetopen function is just the name of the application or entity calling the Internet functions, you can put anything there.I'm sure someone else will reply with an answer, but looking at your code I can't
help wondering if it would work on systems where Internet Explorer is NOT installed?
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by PB.
> i'm looking for a way of parsing a web page in memory. Is that possible in pb?
Not that I know of. This is one area where the Amiga reigns over the PC: it has
a RAM disk that URLDownloadToFile could be used quite nicely with.
Is there any reason why you couldn't download the file to the Windows temp folder
and then parse it from there? That would work even if your app was running from
a CD-ROM or other read-only media... and you can just delete the temp file after
the parsing is done. The user wouldn't even know of the file's existence. The
only potential problem would be to ensure that enough disk space is free to hold
the saved file.
PB - Registered PureBasic Coder
Edited by - PB on 06 June 2002 11:05:47
> i'm looking for a way of parsing a web page in memory. Is that possible in pb?
Not that I know of. This is one area where the Amiga reigns over the PC: it has
a RAM disk that URLDownloadToFile could be used quite nicely with.
Is there any reason why you couldn't download the file to the Windows temp folder
and then parse it from there? That would work even if your app was running from
a CD-ROM or other read-only media... and you can just delete the temp file after
the parsing is done. The user wouldn't even know of the file's existence. The
only potential problem would be to ensure that enough disk space is free to hold
the saved file.
PB - Registered PureBasic Coder
Edited by - PB on 06 June 2002 11:05:47
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Justin.
it is a speed issue, writing to disk is slower.
i browsed the forum and i found the space function, this works:
DefType.l tophandle,connecthandle,openreqhandle,sendreq,r
DefType.s host,geturl,strread
host="yahoo.com"
geturl="/"
INTERNET_OPEN_TYPE_DIRECT=1
INTERNET_DEFAULT_HTTP_PORT=80
INTERNET_SERVICE_HTTP=3
INTERNET_FLAG_NO_CACHE_WRITE=67108864
;initialize functions
tophandle=InternetOpen_("Microsoft Internet Explorer",INTERNET_OPEN_TYPE_DIRECT,"","",0)
connecthandle=InternetConnect_(tophandle,host,INTERNET_DEFAULT_HTTP_PORT,"","",INTERNET_SERVICE_HTTP,0,0)
openreqhandle=HttpOpenRequest_(connecthandle,"GET",geturl,"","",0,INTERNET_FLAG_NO_CACHE_WRITE,0)
sendreq=HttpSendRequest_(openreqhandle,"",0,"",0)
buf$=Space(1023) ;stores 1kb of data
buf2.l ;stores bytes read
;read web page into a string
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
messagebox_(0,buf$,"Contetnts:",#MB_OK) ;display 1Kb
strread=strread+buf$ ;stores all the data
Until buf2=0
there is a problem here, the last block of data is copied twice into the string, you can write the strread string to a file to check it, do you know why this happens?
it is a speed issue, writing to disk is slower.
i browsed the forum and i found the space function, this works:
DefType.l tophandle,connecthandle,openreqhandle,sendreq,r
DefType.s host,geturl,strread
host="yahoo.com"
geturl="/"
INTERNET_OPEN_TYPE_DIRECT=1
INTERNET_DEFAULT_HTTP_PORT=80
INTERNET_SERVICE_HTTP=3
INTERNET_FLAG_NO_CACHE_WRITE=67108864
;initialize functions
tophandle=InternetOpen_("Microsoft Internet Explorer",INTERNET_OPEN_TYPE_DIRECT,"","",0)
connecthandle=InternetConnect_(tophandle,host,INTERNET_DEFAULT_HTTP_PORT,"","",INTERNET_SERVICE_HTTP,0,0)
openreqhandle=HttpOpenRequest_(connecthandle,"GET",geturl,"","",0,INTERNET_FLAG_NO_CACHE_WRITE,0)
sendreq=HttpSendRequest_(openreqhandle,"",0,"",0)
buf$=Space(1023) ;stores 1kb of data
buf2.l ;stores bytes read
;read web page into a string
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
messagebox_(0,buf$,"Contetnts:",#MB_OK) ;display 1Kb
strread=strread+buf$ ;stores all the data
Until buf2=0
there is a problem here, the last block of data is copied twice into the string, you can write the strread string to a file to check it, do you know why this happens?
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Pupil.
Exchange your last lines with this and you'll avoid duplets of your last read data:there is a problem here, the last block of data is copied twice into the string, you can write the strread string to a file to check it, do you know why this happens?
Code: Select all
;read web page into a string
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
messagebox_(0,buf$,"Contents:",#MB_OK) ;display 1Kb
If buf2
strread=strread+buf$ ;stores all the data
EndIf
Until buf2=0
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Justin.
it does not work, same problem.
i think what was happening is that the buf$ was not cleaned after each api call, so if the last block of data read is smaller than 1023 bytes(usually), the data is mixed with the previous one.
this works:
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
;messagebox_(0,buf$,"Contetnts:",#MB_OK)
strread=strread+buf$
buf$=Space(1023)
Until buf2=0
strread=StripTrail(strread)
it does not work, same problem.
i think what was happening is that the buf$ was not cleaned after each api call, so if the last block of data read is smaller than 1023 bytes(usually), the data is mixed with the previous one.
this works:
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
;messagebox_(0,buf$,"Contetnts:",#MB_OK)
strread=strread+buf$
buf$=Space(1023)
Until buf2=0
strread=StripTrail(strread)
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Pupil.
You could probably do this with the 'Space' command also and it would look something like this(haven't actually tested it as i haven't intstalled the library with the Space command):
This may seem to work, but you're actually adding alot of spaces when the read data isn't 1023. For instance, on my computer i downloaded a html page and would have gotten 368 spaces in the middle of the html code, that is not good! I would like to suggest an alternative method that retrieves and stores the data correctly:it does not work, same problem.
i think what was happening is that the buf$ was not cleaned after each api call, so if the last block of data read is smaller than 1023 bytes(usually), the data is mixed with the previous one.
this works:
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
;messagebox_(0,buf$,"Contetnts:",#MB_OK)
strread=strread+buf$
buf$=Space(1023)
Until buf2=0
strread=StripTrail(strread)
Code: Select all
*ptr = AllocateMemory(0, 1024, 0) ;stores 1kb of data
buf2.l ;stores bytes read
;read web page into a string
Repeat
InternetReadFile_(openreqhandle,*ptr,1023,@buf2)
If buf2
PokeB(*ptr+buf2,0)
buf$ = PeekS(*ptr)
strread=strread+buf$ ;stores all the data
EndIf
Until buf2=0
FreeMemory(0)
Code: Select all
buf$ = Space(1024) ;stores 1kb of data
buf2.l ;stores bytes read
;read web page into a string
Repeat
InternetReadFile_(openreqhandle,@buf$,1023,@buf2)
If buf2
PokeB(@buf$+buf2,0)
strread=strread+buf$ ;stores all the data
EndIf
Until buf2=0
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Justin.
i don't understand your code very well, you poke a null into buf to act like an index for the peekS function?, is that?
what happens if the data read has already nulls?, peekS will fail.
how about this?, asumes that bytes read = character read.
i downloaded a web page and then compared it with the real page with ultraedit and there was no differences:
buf$=Space(1024)
bread.l
totalread.l
totalread=0
Repeat
InternetReadFile_(openreqhandle,@buf$,1024,@bread)
totalread=totalread+bread
;messagebox_(0,buf$,"Contetnts:",#MB_OK)
strread=Left(strread+buf$,totalread)
buf$=Space(1024)
Until bread=0
strread=StripTrail(strread)
i don't understand your code very well, you poke a null into buf to act like an index for the peekS function?, is that?
what happens if the data read has already nulls?, peekS will fail.
how about this?, asumes that bytes read = character read.
i downloaded a web page and then compared it with the real page with ultraedit and there was no differences:
buf$=Space(1024)
bread.l
totalread.l
totalread=0
Repeat
InternetReadFile_(openreqhandle,@buf$,1024,@bread)
totalread=totalread+bread
;messagebox_(0,buf$,"Contetnts:",#MB_OK)
strread=Left(strread+buf$,totalread)
buf$=Space(1024)
Until bread=0
strread=StripTrail(strread)
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm
Restored from previous forum. Originally posted by Pupil.
Why i insert a null character is because all strings end with a null character so it's used as a marker that tell where the string ends. If you want data that contains null character you should store that data in some memory allocated with the memory commands and not in a string. As html pages contain only plain text they have no null characters in them so it's ok in this case.i don't understand your code very well, you poke a null into buf to act like an index for the peekS function?, is that?
what happens if the data read has already nulls?, peekS will fail.
Yes the code looks ok, but consider that you're loading in a huge html page. And now consider the amount of data that is passed to the 'Left' command. If you have a 1MB page(unlikely, but just for proving a point) you'll have to do 1024 loops and as many Left operations with data that is growing for each iteration, this is a huge waste of computer power, don't you agree? so i'll suggest this simple improvement:how about this?, asumes that bytes read = character read.
i downloaded a web page and then compared it with the real page with ultraedit and there was no differences:
Code: Select all
buf$=Space(1024)
bread.l
totalread.l
totalread=0
Repeat
InternetReadFile_(openreqhandle,@buf$,1024,@bread)
strread= strread+left(buf$,bread)
buf$=Space(1024)
Until bread=0
-
BackupUser
- PureBasic Guru

- Posts: 16777133
- Joined: Tue Apr 22, 2003 7:42 pm