Programmatically saving a webpage as MHT

Everything else that doesn't fall into one of the other PB categories.
Seymour Clufley
Addict
Addict
Posts: 1267
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Programmatically saving a webpage as MHT

Post by Seymour Clufley »

For a brief distraction, I'm thinking of attempting to write a program that would save a webpage as an MHT file. Hopefully the result would be a procedure as simple to use as this:

Code: Select all

Procedure.b SaveToMHT(url.s,filepath.s)
I know it can be done using Windows hooks and so forth but I think it might be fun to attempt a wholly PB solution using the Base64 encoder.

1. Get URL source code
2. Parse for linked resources (CSS, JS files etc.) and download them
3. Base64 the HTML and resources
4. Put all the Base64 data blocks together in a file with appropriate header tags.

The problem I foresee is this: what if a page depends on Ajax? Say for example, an onLoad event is to ask the server for the page's content?

In this case, the MHT would have none of the content you'd see if you were viewing the page live online. It would just be the bare bones of the page before it contacted the server.

Perhaps a workaround would be to load the page in the webgadget, wait until it's finished loading, then somehow grab the actual HTML inside the gadget (not the HTML source). I have done this on my own pages using a JavaScript function, so presumably one could add that JS function into any webpage.

Or is there another way?

Seymour.
Seymour Clufley
Addict
Addict
Posts: 1267
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Post by Seymour Clufley »

Well I've been beavering away on this and have come up with some code - although it is Windows-only I'm afraid.



;--------------------
; INTRODUCTION

There are two "modes". To make an MHT you need the page's HTML code. "Normal mode" will obtain it using GetGadgetItemText, which doesn't get the current HTML inside the webgadget, but the HTML most recently loaded into it. The other mode, "current", uses API functions to get the current HTML.

On my computer, the code does work but there are some snags:
  • It sometimes incorrectly redirects relative links. This can be clearly seen by missing images.
  • In "current mode", the page's DocType is still obtained using GetGadgetItemText which may not be current. (I believe it's possible for a page's DocType to be altered by JS or PHP, but may be wrong.) Please see this thread.
  • The MHT's "boundary" should theoretically be unique for every MHT. I haven't done that yet as I'm not sure what the rules are. (Prefer to get the straightforward stuff done first!) If anyone knows and would like to write the FormBoundary() procedure, feel free!
  • Pages from this forum seem to be saved quite successfully, but a test from The Times website yielded less satisfying results.
  • There are occasional glitches in the HTML, possibly due to incorrect encoding to Quoted Printable. Snag#1 may be a result of this. Text sizing particularly suffers.
  • It's slow, especially at encoding bitmaps to Base64.
In addition, page resources are downloaded to file then read into memory. This isn't a problem but it is a shortcoming. I haven't yet studied Thalius's download-to-memory code in T&T but will hopefully implement it in the near future.

So the code needs some work. It'll be a while before it's finished but in the meantime I thought it'd be good to post this version in case there's anything in it people can use.

For one thing it includes cross-platform procedures to encode to the Quoted Printable format, which may be useful. (This particular task utilises several procedures and a structure of my own liking - doubtless it could all be condensed into one clean procedure.) However, there may be a problem with this code - see snag#5.

On a personal note, another reason I want to submit this is because for 18 months I've asked many questions on this forum and received a lot of kind assistance from the brethren. I'm glad to be able to give something back, even if it is preliminary code.

I apologise for the unwieldiness of the code. It uses procedures that I keep categorised in many separate PBI's - clearly for this the simplest thing was to lump them all together.



;--------------------
; GOALS
  • Clean, readable code in one include file
  • Procedures
    • URL2MHT(url.s,mhtfile.s,mode.b,express.b)
    • Webgadget2MHT(gad.i,mhtfile.s,mode.b,express.b)
  • Modes
    • NORMAL - use the HTML loaded into the webgadget
    • CURRENT - use the HTML currently inside the webgadget
    • FREEZE - eliminate all server requests to ensure full offline functionality
  • "Express" option, would convert all images to JPEG for faster encoding
  • Possibly cross-platform.


;--------------------
; PRESENT CODE

Just now there is one procedure: URL2MHT(url.s,file.s,mode.b)
  • It will create a folder "C:\ClufleyMHT\". (You can change this to whatever you want.)
  • Inside that it will create a working folder unique to each MHT. This is where resources will be downloaded to.
  • The working folder is deleted at the end of the procedure, leaving just the "C:\ClufleyMHT\" folder.
If you specify no filename, the MHT will be saved inside the ClufleyMHT folder with a filename derived from the webpage's title, as per IE.

An example usage may be:

Code: Select all

url.s = "http://www.purebasic.fr/english/index.php"
file.s = "C:\pb forum index.mht"
status = URL2MHT(url,file,#MHT_Current)
Have fun and please read the section following the code!

Code: Select all

Procedure R(str.s)
  MessageRequester("URL2MHT",str,0)
EndProcedure

Procedure RStatus(str.s,status.b)
  
  b.s
  If status
      b = "True"
  Else
      b = "False"
  EndIf
  
  R(str+" = "+b)
  
EndProcedure


Global c9.s = Chr(9)
Global c10.s = Chr(10)
Global c13.s = Chr(13)
Global c32.s = Chr(32)
Global c34.s = Chr(34)
Global c39.s = Chr(39)

Global mainfolder.s = "C:\ClufleyMHT\"

Enumeration ; modes
  #MHT_Normal
  #MHT_Current
EndEnumeration





;-
;- MATH PROCEDURES


Procedure.d Beat(a.d,b.d)
  
  If a>b
      ProcedureReturn a
  Else
      ProcedureReturn b
  EndIf
  
EndProcedure


Procedure.d Defeat(a.d,b.d)
  
  If a<b
      ProcedureReturn a
  Else
      ProcedureReturn b
  EndIf
  
EndProcedure







;-
;- STRING PROCEDURES

Procedure.s EnsureStart(string.s,start.s)
  
  If Left(string,Len(start)) <> start
      string = start+string
  EndIf
  
  ProcedureReturn string
  
EndProcedure

Procedure.s EnsureEnd(string.s,endd.s)
  
  If Right(string,Len(endd)) <> endd
      string = string+endd
  EndIf
  
  ProcedureReturn string
  
EndProcedure

Procedure.s EnsureNotStart(string.s,start.s)
  
  If Left(string,Len(start)) = start
      string = Mid(string,Len(start)+1,Len(string))
  EndIf
  
  ProcedureReturn string
  
EndProcedure

Procedure.s EnsureNotEnd(string.s,endd.s)
  
  snipped = Len(string)-Len(endd)
  
  If Right(string,Len(endd)) = endd
      string = Left(string,snipped)
  EndIf
  
  ProcedureReturn string
  
EndProcedure




Procedure.s GetBefore(string.s,split.s)
  
  test.l = FindString(string,split,0)
  If test
      string = Left(string,test-1)
  EndIf
  
  ProcedureReturn string
  
EndProcedure

Procedure.s GetAfter(string.s,split.s)
  
  test.l = FindString(string,split,0)
  If test
      ProcedureReturn Right(string,Len(string)-test-Len(split)+1)
  EndIf
  
  ProcedureReturn ""
  
EndProcedure




Procedure.s EnsureNoDuplicates(arr.s,delim.s)
  
  items = CountString(arr,delim)
  
  If Not items
      ProcedureReturn arr
  EndIf
  
  clean.s=delim
  For a = 1 To items
      code.s = StringField(arr,a,delim)
      If Not FindString(clean,delim+code+delim,0)
          clean+code+delim
      EndIf
  Next a
  
  ProcedureReturn EnsureNotStart(clean,delim)
  
EndProcedure



Procedure.s RecursiveReplaceString(str.s,a.s,b.s)
  
  While FindString(str,a,0)
      str=ReplaceString(str,a,b)
  Wend
  
  ProcedureReturn str
  
EndProcedure




Procedure.s BSTRToString(Address.i)
  
  Protected String.s, Length.l
  Length = WideCharToMultiByte_(#CP_ACP, 0, Address, -1, 0, 0, #Null, #Null)
  String = Space(Length)
  WideCharToMultiByte_(#CP_ACP, 0, Address, -1, @String, Length, #Null, #Null)
  SysFreeString_(String)
  ProcedureReturn String
  
EndProcedure




Procedure.s FieldAccumulate(str.s,level.l,delim.s)
  
  fields = CountString(str,delim)
  accum.s
  
  For a = 1 To Defeat(level,fields)
      accum+StringField(str,a,delim)+delim
  Next a
  
  ProcedureReturn accum
  
EndProcedure




Procedure.s RemoveFieldsFromEnd(arr.s,delim.s,postremove.l)
  
  fields=CountString(arr,delim)
  
  take=fields-postremove
  
  If take>0 And take<fields
      r.s
      For f = 1 To take
          r+StringField(arr,f,delim)+delim
      Next f
      ProcedureReturn r
  Else
      ProcedureReturn arr
  EndIf
  
EndProcedure







;-
;- WEBGADGET/MS PROCEDURES

DataSection
  IID_IHTMLDocument: ; {626FC520-A41E-11CF-A731-00A0C9082637}
    Data.l $626FC520
    Data.w $A41E, $11CF
    Data.b $A7, $31, $00, $A0, $C9, $08, $26, $37
  IID_IHTMLDocument2: ; {332C4425-26CB-11D0-B483-00C04FD90119}
    Data.l $332C4425
    Data.w $26CB, $11D0
    Data.b $B4, $83, $00, $C0, $4F, $D9, $01, $19
  IID_IHTMLDocument3: ; {3050F485-98B5-11CF-BB82-00AA00BDCE0B}
    Data.l $3050F485
    Data.w $98B5,$11CF
    Data.b $BB,$82,$00,$AA,$00,$BD,$CE,$0B
  IID_IHTMLDocument4: ; {3050F69A-98B5-11CF-BB82-00AA00BDCE0B}
    Data.l $3050F69A
    Data.w $98B5, $11CF
    Data.b $BB, $82, $00, $AA, $00, $BD, $CE, $0B
  IID_IHTMLDocument5: ; {3050F80C-98B5-11CF-BB82-00AA00BDCE0B}
    Data.l $3050F80C
    Data.w $98B5, $11CF
    Data.b $BB, $82, $00, $AA, $00, $BD, $CE, $0B
EndDataSection


Procedure GetBrowser(g.i) ; By Zapman Inspired by Fr34k
  If IsGadget(g)
    If GetGadgetText(g) = ""
      SetGadgetText(g, "about:blank") ; to avoid error when using Browser
      While WindowEvent():Wend
    EndIf
    
    Browser.IWebBrowser2 = GetWindowLong_(GadgetID(g), #GWL_USERDATA)
    If Browser
      Ready = 0
      ct = 0
      While Ready < 4 And ct<200
        WindowEvent()
        State = 0
        If Browser\get_ReadyState(@BrowserState.i) = #S_OK
          If BrowserState = 4
            Ready + 1
          EndIf
        EndIf
        If Ready = 0 : Delay(5) : EndIf
        ct + 1
      Wend
    EndIf
    ProcedureReturn Browser
  EndIf
EndProcedure


Procedure GetDocumentDispatch(g.i) ;  By Zapman Inspired by Fr34k
  ; Example: DocumentDispatch.IDispatch = GetDocumentDispatch(WebGadget)
  ; Do not forget to release DocumentDispatch when finished to use it
  Browser.IWebBrowser2 = GetBrowser(g)
  If Browser
    If Browser\get_Document(@DocumentDispatch.IDispatch) = #S_OK
      ProcedureReturn DocumentDispatch
    EndIf
  EndIf
EndProcedure


Procedure.s GetWebgadgetHTML(g.i)
  ; Retrieve all HTML content of the document which is in the Webgadget
  
  DocumentDispatch.IDispatch = GetDocumentDispatch(g)
  If DocumentDispatch
    If DocumentDispatch\QueryInterface(?IID_IHTMLDocument2, @Document.IHTMLDocument2) = #S_OK And Document
      Document\get_body(@Element.IHTMLElement); Get the <BODY> Element
      If Element
        If Element\get_parentElement(@Parent.IHTMLElement) = #S_OK And Parent; Get the <HTML> Element
          Parent\get_outerHTML(@bstr)
          Parent\Release()
        EndIf
        Element\Release()
      EndIf
      Document\Release()
    EndIf
    DocumentDispatch\Release()
  EndIf
  
  HTML.s
  If bstr
    HTML = PeekS(bstr, -1, #PB_Unicode) ; get the whole text of the document
    SysFreeString_(bstr)
  EndIf
  
  ; now we need to get the doctype.
  ; I haven't worked out how to use IHTMLDOCUMENT5, so for now I'm having 
  ; to use PB's GetGadgetItemText. In most cases, perhaps all, it will be correct!
  doctype.s = GetGadgetItemText(g,#PB_Web_HtmlCode)
  detect = FindString(LCase(doctype),"<html>",0)
  doctype = Left(doctype,detect-1)
  If FindString(LCase(doctype),"doctype",0)
      ProcedureReturn doctype+c13+HTML
  Else
      ProcedureReturn HTML
  EndIf
  
EndProcedure


Procedure.s WBGetTitle(g.i)
  
  If Not IsGadget(g) : ProcedureReturn : EndIf
  
  Browser.IWebBrowser2 = GetBrowser(g)
  title.s
  If Browser
      Browser\get_LocationName(@bstr)
      If bstr
          title = BSTRToString(bstr)
      EndIf
  EndIf
  
  ProcedureReturn title
  
EndProcedure







;-
;- FILE/FOLDER PROCEDURES

Procedure.b FolderExists(folder.s)
  
  If FileSize(folder) = -2
      ProcedureReturn #True
  EndIf
  
EndProcedure


Procedure.b EnsureFolder(folder.s)
  
  If FolderExists(folder)
      ProcedureReturn #True
  Else
      ProcedureReturn CreateDirectory(folder)
  EndIf
  
EndProcedure


Procedure.b IsDrivePath(path.s)
  If Right(path,2)=":\"
      ProcedureReturn #True
  EndIf
EndProcedure


Procedure.b EnsureFolderPath(path.s)
  
  path=EnsureEnd(path,"\")
  shortpath.s
  levels = CountString(path,"\")
  If levels>1
      For f = 1 To levels
          shortpath = FieldAccumulate(path,f,"\")
          If IsDrivePath(shortpath)
              Continue
          EndIf
          EnsureFolder(shortpath)
      Next f
  EndIf
  
  
EndProcedure



Procedure.b StringToFile(filename.s,string.s)
  
  If Not filename
      ProcedureReturn #False
  EndIf
  
  string = RemoveString(string,c10)
  If FindString(string,c13,0)
      string = EnsureNotEnd(string,c13) ; this removes final linebreak
      string = ReplaceString(string,c13,c13+c10)
  EndIf
  
  file = CreateFile(#PB_Any,filename)
  If IsFile(file)
      WriteString(file,string)
      CloseFile(file)
      ProcedureReturn #True
  EndIf
  
EndProcedure







;-
;- URL/FILENAME PROCEDURES

Procedure.b FileIsOnline(url.s)
  
  online.b
  
  test.s = Left(url,3)
  Select test
      Case "www"
          online = #True
      Case "htt"
          online = #True
      Case "ftp"
          online = #True
      Default
          online = #False
  EndSelect
  
  ProcedureReturn online
  
EndProcedure

Procedure.b WASTEURLIsRelative(url.s)
  If Not FindString(url,"\",0) And Not FileIsOnline(url)
      ProcedureReturn #True
  EndIf
EndProcedure




Procedure.s GetExt(rawfilepath.s)
  
  rawfilepath = LCase(rawfilepath)
  rawfilepath = RemoveString(rawfilepath,"*")
  
  file.s = GetBefore(rawfilepath,"#")
  
  file = ReplaceString(file,"/","\")
  file = EnsureEnd(file,"\")
  file = StringField(file,CountString(file,"\"),"\")
  
  ; ? and & can appear in URL addresses, but are legal for filenames,
  ; so we need everything up until the last instance, but not including it
  If FileIsOnline(rawfilepath)
      file = RemoveFieldsFromEnd(file+"&","&",1)
      file = EnsureNotEnd(file,"&")
      file = RemoveFieldsFromEnd(file+"?","?",1)
      file = EnsureNotEnd(file,"?")
  EndIf
  
  ; a filename can contain multiple dots
  file = EnsureEnd(file,".")
  ext.s = StringField(file,CountString(file,"."),".")
  ext = EnsureNotEnd(ext,".")
  
  ext = RemoveString(ext,"\")
  ;R("EXT: "+ext)
  
  ProcedureReturn ext
  
EndProcedure


Procedure.s ValidateOnlineFilename(url.s)
  
  url = GetBefore(url,"#")
  url = GetBefore(url,"?")
  url = GetBefore(url,c32)
  
  test.s = Left(url,3)
  Select test
      Case "www"
          url="http://"+url
  EndSelect
  
  url=EnsureNotEnd(url,"/")
  
  ProcedureReturn url
  
EndProcedure

Procedure.s ValidateFilename(rawfile.s,replacechar.s)
  
  If FileIsOnline(rawfile)
      ;R("File is online!")
      ProcedureReturn ValidateOnlineFilename(rawfile)
  EndIf
  
  
  rawfile = EnsureNotStart(rawfile,"file:///")
  
  path.s = GetPathPart(rawfile)
  path = EnsureEnd(path,"\")
  path = ReplaceString(path,"/","\") ; maybe not necessary
  
  file.s = GetFilePart(rawfile)
  file = GetBefore(file,"#")
  If FindString(file,".",0)
      file = RemoveFieldsFromEnd(file+".",".",1)
      file = EnsureNotEnd(file,".")
  EndIf
  
  illegalchars.s = "/-\-:-*-?-|-<->-"+c34+"-"
  For i = 1 To CountString(illegalchars,"-")
      file = ReplaceString(file,StringField(illegalchars,i,"-"),replacechar)
  Next i
  file = Trim(file)
  
  ext.s = GetExt(rawfile)
  
  ProcedureReturn path+file+"."+ext
  
EndProcedure


Procedure.s URL_GetPath(url.s)
  
  If FileIsOnline(url)
      url = ValidateOnlineFilename(url)
      folder.s = url
      L=Len(folder)
      If Right(folder,1) = "/"
          L-1
          folder=Left(folder,L)
      EndIf
      
      For a = 1 To L
          test$=Right(folder,a)
          test$=Left(test$,1)
          If test$="/" Or test$="\"
              b=L-a+1
              folder=Left(folder,b)
              Break
          EndIf
      Next
  Else
      folder=GetPathPart(url)
  EndIf
  
  ProcedureReturn folder
  
EndProcedure







;-
;- "LUMP" STUFF

; "lumps are a structure I use to store multi-line strings. Hopefully it isn't too esoteric!

#LumpMax=20000
#MiniLumpMax=500

Structure Lump
  lines.w
  line.s[#LumpMax]
EndStructure
Structure MiniLump
  lines.w
  line.s[#MiniLumpMax]
EndStructure

Procedure.l LumpLinesFromString(str.s)
  ProcedureReturn Defeat(#LumpMax,CountString(str,c13))
EndProcedure


Procedure.s LumpToString(*source.Lump)
  
  str.s
  For a = 1 To *source\lines
      str + *source\line[a] + c13
  Next
  
  ProcedureReturn str
  
EndProcedure


Procedure.b LumpFromString(str.s,*dest.Lump)
  
  str=RemoveString(str,c10)
  str=EnsureEnd(str,c13)
  *dest\lines = LumpLinesFromString(str)
  
  For a = 1 To *dest\lines
      *dest\line[a] = StringField(str,a,c13)
  Next a
  
  ProcedureReturn #True
  
EndProcedure


Procedure.b RebreakLump(*l.Lump)
  
  If *l\lines
      str.s = LumpToString(*l)
      LumpFromString(str,*l)
  EndIf
  
EndProcedure


Procedure.b WrapLumpToLength(*lp.Lump,wl.i)
  
  If Not *lp\lines
      ProcedureReturn #False
  EndIf
  
  cong.s
  For a = 1 To *lp\lines
      start = 1
      cong = ""
      linelength = Len(*lp\line[a])
      If linelength>wl
          Repeat
              cong+Mid(*lp\line[a],start,wl)+c13
              start+wl
          Until start>linelength
          *lp\line[a] = EnsureNotEnd(cong,c13)
      EndIf
  Next a
  
  RebreakLump(*lp)
  
  ProcedureReturn #True
  
EndProcedure







;-
;- DATE PROCEDURES

Structure dtsplit
  second.i
  minute.i
  hour.i
  day.i
  month.i
  year.i
  dayofweek.s
EndStructure


Procedure.s DayName(datelong.i)
  
  d=DayOfWeek(datelong)
  
  n.s
  Select d
      Case 0
          n="Sunday"
      Case 1
          n="Monday"
      Case 2
          n="Tuesday"
      Case 3
          n="Wednesday"
      Case 4
          n="Thursday"
      Case 5
          n="Friday"
      Case 6
          n="Saturday"
  EndSelect
  
  ProcedureReturn n
  
EndProcedure

Procedure.s MonthName(num.i)
  
  Select num
      Case 1
          ProcedureReturn "January"
      Case 2
          ProcedureReturn "February"
      Case 3
          ProcedureReturn "March"
      Case 4
          ProcedureReturn "April"
      Case 5
          ProcedureReturn "May"
      Case 6
          ProcedureReturn "June"
      Case 7
          ProcedureReturn "July"
      Case 8
          ProcedureReturn "August"
      Case 9
          ProcedureReturn "September"
      Case 10
          ProcedureReturn "October"
      Case 11
          ProcedureReturn "November"
      Case 12
          ProcedureReturn "December"
  EndSelect
  
EndProcedure


Procedure.s ExtractifyDate(datelong.i,*dt.dtsplit)
  
  *dt\second = Val(FormatDate("%ss",datelong))
  *dt\minute = Val(FormatDate("%ii",datelong))
  *dt\hour = Val(FormatDate("%hh",datelong))
  *dt\day = Val(FormatDate("%dd",datelong))
  *dt\month = Val(FormatDate("%mm",datelong))
  *dt\year = Val(FormatDate("%yyyy",datelong))
  *dt\dayofweek = DayName(datelong)
  
EndProcedure


Procedure.s MHTDateTime(datelong.i)
  
  ExtractifyDate(datelong,@dt.dtsplit)
  
  monthname.s = MonthName(dt\month)
  
  GMT.s = "-0000"
  
  md.s = 

Left(dt\dayofweek,3)+","+c32+Str(dt\day)+c32+Left(monthname,3)+c32+Str(dt\year)+c32+Str(dt\hour)+":"+Str(dt\minute)+":"+Str(dt\second)+c32+G

MT
  
  ProcedureReturn md
  
EndProcedure





















;-
;- MHT-MAKING PROCEDURES


#QPLL=76




Procedure.s MIMEType(file.s)
  
  ext.s = GetExt(file)
  ext = EnsureStart(ext,".")
  
  mime.s
  Select ext
      Case ".jpg"
          mime = "image/jpeg"
      Case ".gif"
          mime = "image/gif"
      Case ".png"
          mime = "image/png"
      Case ".bmp"
          mime = "image/bmp"
      Case ".css"
          mime = "text/css"
      Case ".html"
          mime = "text/html"
      Case ".js"
          mime = "application/octet-stream"
  EndSelect
  
  If mime
      ProcedureReturn mime
  EndIf
  
  hKey.l = 0
  datasize.l = 255
  KeyValue.s = Space(datasize)
  If RegOpenKeyEx_(#HKEY_CLASSES_ROOT, ext, 0, #KEY_READ, @hKey)
      KeyValue = "application/octet-stream"
  Else
      If RegQueryValueEx_(hKey, "Content Type", 0, 0, @KeyValue, @datasize)
          KeyValue = "application/octet-stream"
      Else
          KeyValue = Left(KeyValue, datasize-1)
      EndIf
      RegCloseKey_(hKey)
  EndIf
  
  ProcedureReturn KeyValue
  
EndProcedure


Procedure WaitForWebgadget2(gad.l,maxmilliseconds.l)
  
  Browser.IWebBrowser2=GetWindowLong_(GadgetID(gad),#GWL_USERDATA) 
  milliwait=125
  Repeat 
      While WindowEvent() : Wend : Browser\get_Busy(@busy.l) 
      count+milliwait
      If busy=#VARIANT_TRUE
          Sleep_(milliwait)
      EndIf 
      If maxmilliseconds>0
          If count=>maxmilliseconds
              Break
          EndIf
      EndIf
  Until busy=#VARIANT_FALSE
  
EndProcedure


Procedure.s FromPosToNext(str.s,pos.i,nxt.s)
  
  ne=FindString(str,nxt,pos)
  If Not ne
      ne=Len(str)
  EndIf
  
  extr.s = Mid(str,pos,ne)
  
  ProcedureReturn extr
  
EndProcedure


Procedure.s GetLinkedResourceNames(html.s)
  
  resourcelist.s
  newfile.s
  
  extnames.s = "jpg|png|bmp|gif|js|php|css|"
  exts = CountString(extnames,"|")
  
  src.s = "src="
  sources = CountString(html,src)
  If sources
      ;R("INSTANCES OF SRC = "+Str(sources))
      For s = 1 To sources
          detect=FindString(html,src,detect+1)
          If detect
              newfile = FromPosToNext(html,detect,src)
              newfile = StringField(newfile,1,c13)
              newfile = StringField(newfile,1,">")
              newfile = RemoveString(newfile,src)
              newfile = RemoveString(newfile,c34)
              newfile = StringField(newfile,1,c32)
              newfile = GetBefore(newfile,"?")
              resourcelist+newfile+"|"
          EndIf
      Next s
  EndIf
  detect=0
  
  
  bgimurl.s = "background-image:url"
  html = ReplaceString(html,"background-image: url",bgimurl)
  bgimurls = CountString(html,bgimurl)
  If bgimurls
      ;R("INSTANCES OF "+bgimurl+"  =  "+Str(bgimurls))
      For a = 1 To bgimurls
          detect=FindString(html,bgimurl,detect+1)
          If detect
              newfile = FromPosToNext(html,detect,bgimurl)
              newfile = RemoveString(newfile,bgimurl)
              newfile = StringField(newfile,1,")")
              newfile =  EnsureNotStart(newfile,"(")
              ;R(newfile)
              resourcelist+newfile+"|"
          EndIf
      Next a
  EndIf
  
  ; and now a special search for any linked stylesheets
  If FindString(LCase(html),"stylesheet",0)
      html+c13
      For a = 1 To CountString(html,c13)
          line.s = LCase(StringField(html,a,c13))
          If FindString(line,"<link rel",0)
              If FindString(line,"stylesheet",0)
                  line = GetAfter(line,"hre")
                  line = GetAfter(line,"=")
                  line = RemoveString(line,"/>")
                  line = RemoveString(line,">")
                  line = RemoveString(line,c34)
                  line = Trim(line)
                  resourcelist+line+"|"
              EndIf
          EndIf
      Next a
  EndIf
  
  resourcelist = EnsureNoDuplicates(resourcelist,"|")
  
  ProcedureReturn resourcelist
  
EndProcedure




Procedure.s RandomCode(l.i)
  
  c.s
  For a = 1 To l
      c+Str(Random(9))
  Next a
  
  ProcedureReturn c
  
EndProcedure




Procedure.b WrapLumpToLengthWithEndchar(*lp.Lump,wl.i,endchar.s)
  
  If Not *lp\lines
      ProcedureReturn #False
  EndIf
  
  If Not endchar
      endchar=c13
  EndIf
  
  cong.s
  For a = 1 To *lp\lines
      start = 0
      cong = ""
      linelength = Len(*lp\line[a])
      If linelength>wl
          Repeat
              cong+Mid(*lp\line[a],start+1,wl)+endchar
              start+wl
          Until start>linelength
          *lp\line[a] = EnsureNotEnd(cong,c13)
      EndIf
  Next a
  
  RebreakLump(*lp)
  
  ProcedureReturn #True
  
EndProcedure


Procedure.s Ascii2QP(ascii.i)
  
  h.s = Hex(ascii)
  If Len(h)=1
      h="0"+h
  EndIf
  
  ProcedureReturn "="+h
  
EndProcedure


Procedure.s QuotedPrintable(raw.s)
  
  LumpFromString(raw,@lp.Lump)
  
  c.s
  qp.s
  For l = 1 To lp\lines
      linelength = Len(lp\line[l])
      For a = 1 To linelength
          c=Mid(lp\line[l],a,1)
          asc=Asc(c)
          Select asc
              Case 10
                  
              Case 13
                  qp+c
              Case 32
                  If a=linelength
                      qp+Ascii2QP(asc)
                  Else
                      qp+c32
                  EndIf
              Case 33 To 60
                  qp+c
              Case 62 To 126
                  qp+c
              Default
                  qp+Ascii2QP(asc)
          EndSelect
      Next a
      qp+c13
  Next l
  
  LumpFromString(qp,@lp.Lump)
  WrapLumpToLengthWithEndchar(@lp,#QPLL-2,"="+c13)
  
  flattened.s = LumpToString(@lp)
  flattened = RecursiveReplaceString(flattened,c13+c13,c13)
  
  ProcedureReturn flattened
  
EndProcedure




Procedure.s EncodeToBase64(*rawdata,rawloaf.i)
  
  enclength = 1.5*rawloaf
  *enc = AllocateMemory(enclength)
  
  encodedlength = Base64Encoder(*rawdata,rawloaf,*enc,enclength)
  
  encodedstring.s = PeekS(*enc,encodedlength)
  
  FreeMemory(*enc)
  
  ProcedureReturn encodedstring
  
EndProcedure


Procedure.s EncodeToQuotedPrintable(*rawdata,rawloaf.i)
  
  rawstring.s = PeekS(*rawdata,rawloaf)
  
  encodedstring.s = QuotedPrintable(rawstring)
  
  ProcedureReturn encodedstring
  
EndProcedure




Procedure.s ContentLocation(urlfile.s)
  
  If FileIsOnline(urlfile)
      
  Else
      urlfile = "file:///"+urlfile
  EndIf
  
  ProcedureReturn "Content-Location: "+urlfile
  
EndProcedure


Procedure.s Alter(str.s) ; for testing
  ; this alters the filename of a resource within the MHT, so that we can tell
  ; whether the MHT is still accessing an online resource instead of an internal one.
  ext$ = GetExt(str)
  str = EnsureNotEnd(str,"."+ext$)
  ProcedureReturn str+"ALTER."+ext$
EndProcedure

Procedure.s FormPart(urlfile.s,folder.s,boundary.s,network.b)
  
  urlfile = ValidateFileName(urlfile,"")
  
  ext.s = GetExt(urlfile)
  encodemethod.s
  Select ext
      Case "html", "css", "js", "php"
          encodemethod="quoted-printable"
      Default
          encodemethod="base64"
  EndSelect
  
  header.s = "--"+boundary +c13
  header + "Content-Type: "+MIMEType(urlfile) +c13
  header + "Content-Transfer-Encoding: "+encodemethod +c13
  header + ContentLocation(Alter(urlfile)) +c13+c13
  
  
  localfile.s = folder+RandomCode(8)+c32+StringField(GetFilePart(urlfile),1,".")+"."+ext
  
  If FileIsOnline(urlfile)
      ;R("File is online")
      If network
          status = ReceiveHTTPFile(urlfile,localfile)
      Else
          R("Can't download because network=0.")
      EndIf
  Else
      ;R("File is offline (local)")
      status = CopyFile(urlfile,localfile)
  EndIf
  
  file = OpenFile(#PB_Any,localfile)
  If Not file
      ProcedureReturn ""
  EndIf
  loaf = Lof(file)
  
  If loaf<1
      CloseFile(file)
      ProcedureReturn ""
  EndIf
  *filedata = AllocateMemory(loaf)
  
  ReadData(file,*filedata,loaf)
  CloseFile(file)
  
  encodedstring.s
  Select encodemethod
      Case "base64"
          ;R("Encode method: Base64")
          encodedstring = EncodeToBase64(*filedata,loaf)
      Case "quoted-printable"
          ;R("Encode method: Quoted-Printable")
          encodedstring = EncodeToQuotedPrintable(*filedata,loaf)
  EndSelect
  
  FreeMemory(*filedata)
  
  LumpFromString(encodedstring,@lm.Lump)
  WrapLumpToLength(@lm,#QPLL)
  encodedstring = LumpToString(@lm)
  
  ProcedureReturn header+encodedstring+c13+c13+c13
  
EndProcedure




Procedure.s GetWebgadgetURL(wg.i)
  
  If IsGadget(wg)
      WebObject.IWebBrowser2 = GetWindowLong_(GadgetID(wg), #GWL_USERDATA)
      WebObject\Get_LocationURL(@BSTR_Url.l)
      If BSTR_Url
          ProcedureReturn BSTRToString(BSTR_Url)
      EndIf
  EndIf
  
EndProcedure


Procedure.s GetWebpageBaseDirectory(wg.i)
  
  realurl.s = GetWebgadgetURL(wg)
  
  If realurl
      baseurl.s = URL_GetPath(realurl)
  EndIf
  
  html.s = LCase(GetGadgetItemText(wg,#PB_Web_HtmlCode))
  If FindString(html,"<base",0)
      html = GetAfter(html,"<base")
      html = GetBefore(html,">")
      html = GetAfter(html,"href=")
      html = Trim(html)
      If html
          baseurl=html
      EndIf
  EndIf
  
  ProcedureReturn EnsureEnd(baseurl,"/")
  
EndProcedure






Procedure.s FormBoundary()
  ProcedureReturn "----=_NextPart_000_001F_01C96422.C9363B50"
EndProcedure




Procedure.s FormHTMLPart(html.s,url.s,boundary.s)
  
  header.s = "--"+boundary +c13
  header + "Content-Type: text/html;" +c13
  header + "        charset="+c34+"iso-8859-1"+c34 +c13
  header + "Content-Transfer-Encoding: quoted-printable" +c13
  header + ContentLocation(url)
  
  header+c13+c13
  
  encoded.s = QuotedPrintable(html)
  
  ProcedureReturn header+encoded+c13+c13+c13
  
EndProcedure




Procedure.s FormHeaderData(title.s,boundary.s)
  
  header.s = "From: <Saved by ClufleyMHT>" +c13
  header + "Subject: "+title +c13
  header + "Date: "+MHTDateTime(Date()) +c13
  header + "MIME-Version: 1.0" +c13
  header + "Content-Type: multipart/related;" +c13
	header + "        type="+c34+"text/html"+c34+";" +c13
  header + "	boundary="+c34+boundary+c34 +c13
  header + "X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3350" +c13+c13
  
  header + "This is a multi-part message in MIME format." +c13+c13
  
  
  ProcedureReturn header
  
EndProcedure







;-
;- THE MHT PROCEDURE

Procedure.b URL2MHT(url.s,mht.s,mode.b)
  
  If Not url
      ProcedureReturn #False
  EndIf
  
  folder.s = mainfolder+RandomCode(8)+"\"
  EnsureFolderPath(folder)
  
  ;R("URL: "+url+Chr(13)+"MHT: "+mht)
  
  ww=800
  wh=800
  win = OpenWindow(#PB_Any, 0, 0, ww, wh, "MHT displayer", #PB_Window_Invisible)
  wg = WebGadget(#PB_Any, 5, 5, ww-10, wh-10, url)
  WaitForWebgadget2(wg,10000)
  
  
  baseurl.s = GetWebpageBaseDirectory(wg)
  title.s
  html.s
  If mode=#MHT_Current
      title = WBGetTitle(wg)
      html = GetWebgadgetHTML(wg)
  Else
      title = GetGadgetItemText(wg,#PB_Web_PageTitle)
      html = GetGadgetItemText(wg,#PB_Web_HtmlCode)
  EndIf
  CloseWindow(win)
  
  If Not mht
      mht = mainfolder+ReplaceString(title,c32+c32,c32)
  EndIf
  mht = EnsureEnd(mht,".mht")
  
  mht = ValidateFilename(mht,c32)
  
  boundary.s = FormBoundary()
  
  headerdata.s = FormHeaderData(title,boundary)
  
  filelist.s = GetLinkedResourceNames(html)
  
  files = CountString(filelist,"|")
  Dim filedata.s(files+1)
  network.b = InitNetwork()
  ;RStatus("network",network)
  
  
  resourcepointer.s
  resourcepath.s
  If files
      For f = 1 To files
          resourcepointer = StringField(filelist,f,"|")
          resourcepointer = RemoveString(resourcepointer,c34)
          
          If Left(resourcepointer,1)="/"
              resourcepath = EnsureNotEnd(RemoveFieldsFromEnd(baseurl,"/",1),"/") +resourcepointer
          Else
              If FileIsOnline(resourcepointer)
                  resourcepath = resourcepointer
              Else
                  resourcepath = baseurl+resourcepointer
              EndIf
          EndIf
          
          filedata(f) = FormPart(resourcepath,folder,boundary,network)
          
          ; now update html with "internal" reference to resource
          html = ReplaceString(html,resourcepointer,Alter(resourcepath))
      Next f
  EndIf
  
  ; encode html to Quoted Printable format
  qphtml.s = FormHTMLPart(html,url,boundary)
  
  total.s = headerdata+qphtml
  If files
      For f = 1 To files
          total+filedata(f)
      Next f
  EndIf
  
  DeleteDirectory(folder,"*.*",#PB_FileSystem_Recursive|#PB_FileSystem_Force)
  
  
  EnsureFolderPath(GetPathPart(mht))
  status = StringToFile(mht,total)
  
  ProcedureReturn status
  
EndProcedure




;- AT LAST, PROGRAM START!

url.s
destfile.s

;url = InputRequester("URL TO MAKE MHT FROM","Please enter URL below:","")

;url = "http://www.purebasic.fr/english/index.php" ; main forum index
url = "http://www.purebasic.fr/english/viewforum.php?f=13" ; coding forum
;url = "http://www.adobeforums.com/webx/.3bc459a3/"
;url = "http://timesonline.typepad.com/timesarchive/2009/01/10-freaky-finds.html"
;url = "http://en.wikipedia.org/wiki/Ralf_Hutter"


status = URL2MHT(url,destfile,#MHT_Current)

R("STATUS: "+Str(status))


End


;--------------------
; CLUFLEY NEEDS YOU!

If you want to help with this "project", please do.

You can help by testing out this code with different URLs and reporting errors. The URLs would be helpful.

If you can think of options for the finished function, please post them. ("Express mode" etc.)

Finally there are two things I'm having trouble with:
  • Obtaining the page's DocType using the IHTMLDocument5 interface. Specific thread.
  • Obtaining images created by a PHP/CGI call. Specific thread.
    This is a serious shortcoming because the purpose of an MHT is that it displays properly offline. All needed resources are stored inside the file. Normal, specified images can easily be obtained and encoded but if it's done via a server request, there'll be a big ugly hole where the image should be!
  • Legal forming of a unique "boundary" for each MHT file. See FormBoundary() procedure.
Thanks for reading,
Seymour.
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re:

Post by MachineCode »

Here's some old VB code to save a web page as an MHT file. Does anyone know how to convert it to PureBasic? It looks like COMate is the tool but I have no idea how to convert this to the format that COMate uses. Would be an interesting lesson to compare the results, if it can be done.

Source: http://sahilmalik.blogspot.com/2005/11/ ... pages.html

Code: Select all

Private Sub SavePage(ByVal Url As String, ByVal FilePath As String)
Dim iMessage As CDO.Message = New CDO.Message
iMessage.CreateMHTMLBody(Url, _
CDO.CdoMHTMLFlags.cdoSuppressNone, "", "")
Dim adodbstream As ADODB.Stream = New ADODB.Stream
adodbstream.Type = ADODB.StreamTypeEnum.adTypeText
adodbstream.Charset = "US-ASCII"
adodbstream.Open()
iMessage.DataSource.SaveToObject(stream, "_Stream")
adodbstream.SaveToFile(FilePath, _
ADODB.SaveOptionsEnum.adSaveCreateOverWrite)
End Sub
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Seymour Clufley
Addict
Addict
Posts: 1267
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Re: Programmatically saving a webpage as MHT

Post by Seymour Clufley »

The code I posted in this thread never really worked very well. MHT is obsolete anyway. You should check out this code instead, which produces self-contained HTML files viewable in any browser (except IE1-8).
JACK WEBB: "Coding in C is like sculpting a statue using only sandpaper. You can do it, but the result wouldn't be any better. So why bother? Just use the right tools and get the job done."
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Programmatically saving a webpage as MHT

Post by MachineCode »

MHT may be obsolete, but it still works. So, I'd still like to see a conversion for the VB code, if someone can help.

Also, you say your example doesn't work with IE 1-8... unfortunately we have IE 6 at work, so I can't use your code. And yours doesn't work with online web pages? That's what I need MHT for. I want to save an online page, not a local page.

Thanks anyway. I hope someone can help with the MHT conversion instead, even just to teach how to convert the VB example to COMate.
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Seymour Clufley
Addict
Addict
Posts: 1267
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Re: Programmatically saving a webpage as MHT

Post by Seymour Clufley »

IE6... you poor guy!
JACK WEBB: "Coding in C is like sculpting a statue using only sandpaper. You can do it, but the result wouldn't be any better. So why bother? Just use the right tools and get the job done."
MachineCode
Addict
Addict
Posts: 1482
Joined: Tue Feb 22, 2011 1:16 pm

Re: Programmatically saving a webpage as MHT

Post by MachineCode »

Why? IE 6 does the job. Software doesn't stop working just because it's old. It's used for in-house web-based apps, not for surfing.
Microsoft Visual Basic only lasted 7 short years: 1991 to 1998.
PureBasic: Born in 1998 and still going strong to this very day!
Seymour Clufley
Addict
Addict
Posts: 1267
Joined: Wed Feb 28, 2007 9:13 am
Location: London

Re: Programmatically saving a webpage as MHT

Post by Seymour Clufley »

I'm sure IE6 is fine for in-house apps. You'll be tailoring those apps to work in IE6 (I'm guessing they wouldn't work in modern browsers). But as a browser for today's Internet, IE6 is a joke. According to caniuse.com, it's only 6% capable, compared to Chrome's 87%!

It only "does the job" if the job doesn't involve canvas, SVG, video, transparent images, standards-compliant CSS, or a host of other things modern browsers handle. And it doesn't support dataURL, which is why "Internalised" files won't work in it.

IE6 will have you jumping through hoops to get things working whereas in other browsers, you can use standardised modern technologies and get the job done much faster.

Honestly, ask your boss to install Chrome!
JACK WEBB: "Coding in C is like sculpting a statue using only sandpaper. You can do it, but the result wouldn't be any better. So why bother? Just use the right tools and get the job done."
Post Reply