Page 1 of 2

Extract link from a HTML file

Posted: Sun Aug 21, 2005 12:49 pm
by bidanh00co
I have a HTML file which contains many website link. If I copy&Paste link from them...it may be a fool way :P..

How could we make a program which can extract all Link from a HTML file ??... I know the link alway begin with "http://www." and end with "

Posted: Sun Aug 21, 2005 1:38 pm
by Killswitch

Code: Select all

While Eof(0)=0
  
  String.s+ReadString()

Wend

Repeat
  
  Debug Mid(String,FindString(String,"http://",1),FindString(String,Chr(34),1)
  String=Mid(String,FindString(String,Chr(34),1)+1,Len(String))

Until String=""
Untested, but thats the general idea.

Posted: Sun Aug 21, 2005 10:42 pm
by CONVERT
Some parts of these procedures may help you. They are looking for image file name in .html files.

wfin$ can contain chr(34) for you.
wpos can contain the position in the string, after http//www.

Code: Select all

Procedure.s extrait(enr$,wpos,wfin$)
res$ = ""

wposg = FindString(enr$,wfin$,wpos)
If wposg = wpos
  wlib$ = wfile$ +  Chr(13)
  wlib$ = wlib$ + " impossible de trouver " + wfin$ + " dans extrait()" + Chr(13)
  wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
  wlib$ = wlib$ + "ERREUR FATALE."
  mess(wlib$)
  End
EndIf

res$ = Mid(enr$,wpos,wposg - wpos)

ProcedureReturn res$
EndProcedure
In the following code, calling the precedent one, you can replace ,"<img src=" + G$ + "_", by ,"http://www",.

You can replace ".jpg" by chr(34).

(G$ contains chr(34))

Code: Select all

Procedure Rech_Image(wdir_current_orig$, wfile_htm$)

wfile$ = wdir_current_orig$ + "\" + wfile_htm$

wres = ReadFile(GN_htm,wfile$)
If wres = 0
  Mess(wfile$ + " NON OUVERT dans Rech_Image(). Erreur fatale.")
  End
EndIf

While Eof(GN_htm) = 0
  enr$ = ReadString()
  wpos = FindString(enr$,"<img src=" + G$ + "_",1)
  If wpos <> 0
    wnom_numero$ = extrait(enr$,wpos + 10,".jpg")    ; ----------------   IMAGES
                                                     ; vignettes 00001_v.jpg non traités
                                                     ; cas de contact1.html général
    If wnom_numero$ = ""
      wlib$ = wfile$ +  Chr(13)
      wlib$ = wlib$ + " impossible d'extraire le nom 'numero' de l'image" + Chr(13)
      wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
      wlib$ = wlib$ + "ERREUR FATALE."
      mess(wlib$)
      End
    Else
      wpos2 = FindString(enr$,"alt=" + G$, wpos + 14)
      If wpos2 <> 0
        wnom_definitif$ = extrait(enr$,wpos2 +5,G$)
        If wnom_definitif$ = ""
          wlib$ = wfile$ +  Chr(13)
          wlib$ = wlib$ + "Pour '<img src=' : " + wnom_numero$ + " à partir de " + Str(wpos + 14) + Chr(13)
          wlib$ = wlib$ + " impossible d'extraire le nom 'alt=' de l'image" + Chr(13)
          wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
          wlib$ = wlib$ + "ERREUR FATALE."
          mess(wlib$)
          End
        Else
          If GInum > GInum_maxi
            wlib$ = wfile$ +  Chr(13)
            wlib$ = wlib$ + "Nombre d'images supérieur à " + Str(GInum_maxi) + Chr(13)
            wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
            wlib$ = wlib$ + "ERREUR FATALE."
            mess(wlib$)
            End
          EndIf
          
          Tnum$(GInum) = wnom_numero$
          Tnom$(GInum) = wnom_definitif$
          GInum = GInum + 1
        EndIf
      EndIf
    EndIf
  EndIf
Wend

CloseFile(GN_htm)

ProcedureReturn
EndProcedure

Posted: Mon Aug 22, 2005 3:26 am
by bidanh00co
Thank CONVERT & KillSwitch much !

I tried CONVERT's code first. .... a little difficult (your comment is not English :)

When I tested it, the Error reports that "mess() is not a function, an array, or linked list"

The exact error comes from the line "mess(wlib$) "

Sorry for my stupid question, I really do not understand about your code ... :(

- I can catch the idea of KillSwitch. :)

Posted: Mon Aug 22, 2005 4:05 am
by dracflamloc
Um, if you have the html source code, just look for "<a href=" and grab upto "</a>".

This will give you every link.

Posted: Mon Aug 22, 2005 2:31 pm
by CONVERT
Try this more pertinent code:

Code: Select all

Procedure.s extract(enr$,wpos,wfin$) 
res$ = "" 

wposg = FindString(enr$,wfin$,wpos) 
If wposg <> wpos And wposg <> 0
  res$ = Mid(enr$,wpos,wposg - wpos) 
EndIf

ProcedureReturn res$ 
EndProcedure 


Procedure Look_url(wdir_current_orig$, wfile_htm$,wno_out) 

wfile$ = wdir_current_orig$ + "\" + wfile_htm$ 

wno_in = ReadFile(#PB_Any,wfile$) 
If wno_in = 0 
  MessageRequester("Error",wfile$ + " not opened.",0) 
  End 
EndIf 

While Eof(wno_in) = 0 
  enr$ = ReadString()
  wpos = FindString(enr$,"http://",1) 
  If wpos <> 0 
    wurl$ = extract(enr$,wpos + 7,Chr(34)) 
    If wurl$ <> "" 
      UseFile(wno_out)
      WriteStringN(wurl$)
      UseFile(wno_in)
    EndIf
  EndIf
Wend 

CloseFile(wno_in) 

ProcedureReturn 
EndProcedure 

;- BEGIN

infile$ = "your file.html"
current_dir$ = "your directory"

wno_out = CreateFile(#PB_Any,"out.txt")
If wno_out = 0
  MessageRequester("Error","out.txt not created.",0)
  End
EndIf

Look_url(current_dir$, infile$,wno_out) 

CloseFile(wno_out)

End
The result is in OUT.TXT file in the current directory.

Posted: Mon Aug 22, 2005 3:25 pm
by Dare2
Assuming you are looking at an HTML source:

Code: Select all

html.s="Say this is where all your links are embedded."
html+" For example <a href="+Chr(34)+"apage.html"+Chr(34)+">Click text</a>"
html+" and sans quotes <A HREF=http://www.google.com>GOOGLE</A>"
html+" and perhaps <a href="+Chr(34)+"javascript:doWeReallyWantThis(var);"+Chr(34)+">JS Call</a>"

Debug html

; This loop pulls all hyperlinks. Handle result in "found" as needed.

p=1
Repeat
  p=FindString(UCase(html),"<A",p)
  If p
    p=FindString(html,"=",p)+1
    e=FindString(html,">",p)
    found.s=Trim(ReplaceString(Mid(html,p,e-p),Chr(34),""))
    Debug found
    p=e
  EndIf
Until p=0

Posted: Mon Aug 22, 2005 7:16 pm
by ricardo
Dare2 wrote:

Code: Select all

Repeat
  p=FindString(UCase(html),"<A",p)
  If p
    p=FindString(html,"=",p)+1
    e=FindString(html,">",p)
    found.s=Trim(ReplaceString(Mid(html,p,e-p),Chr(34),""))
    Debug found
    p=e
  EndIf
Until p=0
I think this is faster, right?

Code: Select all

Repeat
  p=FindString(UCase(html),"<A HREF=",p)
  If p
    p+9
    e=FindString(html,CHR(34),p+1)
    found.s=Mid(html,p,e-p)
    Debug found
    p=e
  EndIf
Until p=0

Posted: Tue Aug 23, 2005 1:09 am
by Dare2
Yes. :) And also avoids <A NAME= situations.

And the p=e can go as well.

(I must have released a beta!)

Posted: Wed Aug 24, 2005 6:15 am
by bidanh00co
Thank CONVERT, Dare2 and Ricardo much !

I test all and it works very well. That's cool!

But, In case I just want to extract a link which ended with file exstension such as: *.MOV, *.AVI , *.WMA....?? How can I modify these code ??

Posted: Wed Aug 24, 2005 8:45 am
by rsts
bidanh00co wrote: How can I modify these code ??

1st - Do you know anything about programming?

:)

Posted: Wed Aug 24, 2005 11:02 am
by bidanh00co
Hi rsts !

I'm just learn programming...really bad :P

Of course I post the question is "How to modify these code to make it possible to extract the link which ended with special extension: MOV, AVI, WMV..."

You know, I have many ideas, but sometimes I can not find the way, I post it here, even I can't fully understand all the code all of you provide.

Posted: Wed Aug 24, 2005 11:14 am
by Dare2
:)

How about using Right() to check the last 4 characters of an extracted link to see if it is ".AVI"?

Posted: Wed Aug 24, 2005 2:26 pm
by CONVERT
Add this code at the end of my extract procedure:

Code: Select all

if right(res$,4) <> ".xxx"
  res$ = ""
endif

Posted: Wed Aug 24, 2005 2:29 pm
by CONVERT
Or better:

Code: Select all

If Ucase(Right(res$,4)) <> ".XXX"
  res$ = ""
Endif