Downloading to an HTML file then converting it to plain text

Just starting out? Need help? Post your questions and find answers here.
Perkin
Enthusiast
Enthusiast
Posts: 504
Joined: Thu Jul 03, 2008 10:13 pm
Location: Kent, UK

Post by Perkin »

Hi, I've just adapted an example from the codearchive

Does this do what you want?

Code: Select all

; English forum:http://www.purebasic.fr/english/viewtopic.php?p=35682 
; Author: Freak (updated for PB 4.00 by mardanny71, description added by Andre)
;Adapted: Perkin (05/08/08)
; Date: 26. August 2003 
; OS: Windows 
; Demo: No 

;---------------------------------------------------------------------------------------------------- 
; Note about using the "copy" function:
; This will copy ALL text into the clipboard and so will display its contents 
; in the messagerequester later.

Enumeration 1 
  #OLECMDID_OPEN          
  #OLECMDID_NEW        
  #OLECMDID_SAVE          
  #OLECMDID_SAVEAS            
  #OLECMDID_SAVECOPYAS    
  #OLECMDID_PRINT        
  #OLECMDID_PRINTPREVIEW        
  #OLECMDID_PAGESETUP        
  #OLECMDID_SPELL            
  #OLECMDID_PROPERTIES  
  #OLECMDID_CUT          
  #OLECMDID_COPY        
  #OLECMDID_PASTE            
  #OLECMDID_PASTESPECIAL    
  #OLECMDID_UNDO            
  #OLECMDID_REDO          
  #OLECMDID_SELECTALL        
  #OLECMDID_CLEARSELECTION 
  #OLECMDID_ZOOM            
  #OLECMDID_GETZOOMRANGE      
  #OLECMDID_UPDATECOMMANDS  
  #OLECMDID_REFRESH            
  #OLECMDID_STOP              
  #OLECMDID_HIDETOOLBARS      
  #OLECMDID_SETPROGRESSMAX    
  #OLECMDID_SETPROGRESSPOS  
  #OLECMDID_SETPROGRESSTEXT    
  #OLECMDID_SETTITLE          
  #OLECMDID_SETDOWNLOADSTATE  
  #OLECMDID_STOPDOWNLOAD      
EndEnumeration 

Enumeration 0 
  #OLECMDEXECOPT_DODEFAULT      
  #OLECMDEXECOPT_PROMPTUSER        
  #OLECMDEXECOPT_DONTPROMPTUSER    
  #OLECMDEXECOPT_SHOWHELP        
EndEnumeration 

; ------------------------------------------------------------------------------------------ 

; Now the code 

#WebGadget = 1 
#Button = 2 

OpenWindow(0, 0, 0, 800, 600,"WebBrowser" ,#PB_Window_ScreenCentered|#PB_Window_SystemMenu ) 

CreateGadgetList(WindowID(0)) 

  WebGadget(#WebGadget, 10, 40, 780, 550, "www.purebasic.com") 
  ButtonGadget(#Button, 10, 10, 60, 20, "Copy") 

; Fred the genius stored the Interface pointer to IWebBrowser2 in the DATA 
; member of the windowstructure of the WebGadget containerwindow, so we can get 
; that easily: 
  
WebObject.IWebBrowser2 = GetWindowLong_(GadgetID(#WebGadget), #GWL_USERDATA) 

Repeat 
  Event = WaitWindowEvent() 
  If Event = #PB_Event_Gadget And EventGadget() = #Button 
    
    ; Now here's the actual copy thing, not that complicated... 
    WebObject\ExecWB(#OLECMDID_SELECTALL, #OLECMDEXECOPT_DODEFAULT, 0, 0) 
    WebObject\ExecWB(#OLECMDID_COPY, #OLECMDEXECOPT_DODEFAULT, 0, 0) 
    WebObject\ExecWB(#OLECMDID_CLEARSELECTION, #OLECMDEXECOPT_DONTPROMPTUSER, 0, 0) 
    
    fullcontents.s=GetClipboardText()
    ;Debug fullcontents

    ; little test: 
    MessageRequester("", fullcontents, 0) 
    
  EndIf 
Until Event = #PB_Event_CloseWindow 

End 
%101010 = $2A = 42
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

@Perkin: That's not reliable. Run it once, then exit and change the URL and
run it again, and the previous copied text is still there instead of the new.
Sparkie
PureBatMan Forever
PureBatMan Forever
Posts: 2307
Joined: Tue Feb 10, 2004 3:07 am
Location: Ohio, USA

Post by Sparkie »

@Perkin: If you want to use your method you'll have to send a mouse click to ensure the document has focus, otherwise you'll experience the problem PB has described.
What goes around comes around.

PB 5.21 LTS (x86) - Windows 8.1
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

Here's a simple solution with RegularEpxressions.
Tasks like this is what RegularExpressions are specialised for!

This is just a simple demonstration, don't forget to include FreeRegularExpression() and CloseFile().

Code: Select all

EnableExplicit

InitNetwork()

If Not ReceiveHTTPFile("http://www.google.de", "C:\index.html") ; RECEIVING AND SAVING TO FILE
  Debug "HTML cannot be downloaded"
  End
EndIf

Define html$

If ReadFile(0, "C:\index.html") ; READING AND REMOVING HTML
  While Not Eof(0)
    html$+ReadString(0)
  Wend
  CreateRegularExpression(0, "(?iU)<(script|style)>.*</\1>") ; Remove JavaScripts and Stylesheets
  CreateRegularExpression(1, "(?iU)<.*>") ; Remove HTML-Tags
  html$=ReplaceRegularExpression(0, html$, "")
  html$=ReplaceRegularExpression(1, html$, "")
  Debug html$
EndIf

If CreateFile(0, "C:index.html") ; WRITE BACK TO FILE
  WriteStringN(0, html$)
  RunProgram("C:\index.html")
EndIf
PB 4.30

Code: Select all

onErrorGoto(?Fred)
TerryHough
Enthusiast
Enthusiast
Posts: 781
Joined: Fri Apr 25, 2003 6:51 pm
Location: NC, USA
Contact:

Post by TerryHough »

Sparkie wrote:@Perkin: If you want to use your method you'll have to send a mouse click to ensure the document has focus, otherwise you'll experience the problem PB has described.
How do I ensure the document has focus.

My routine does everything I need, but I have to manually click on the document first. Then everything works as expected.

I would like to make that focus automatic.

Can someone give me a clue?
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

The code I posted works great and doesn't need focussing at all. Don't know
why people seem to be ignoring it for other non-working solutions. (Even the
regular expression one doesn't work properly).
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
TerryHough
Enthusiast
Enthusiast
Posts: 781
Joined: Fri Apr 25, 2003 6:51 pm
Location: NC, USA
Contact:

Post by TerryHough »

PB wrote: Don't know
why people seem to be ignoring it for other non-working solutions.
PB,
I assure you that I didn't ignore your code. I tried it, and it works for some web pages. But, it didn't work with the one I needed to convert at all. It loads the web page, but convert doesn't find a thing. Probably something about how the page is generated.

The code from Perkin works perfectly, but only after giving the document focus.

So, I just need a way to make the focus automatic and I am stumped.
lionel_om
User
User
Posts: 31
Joined: Wed Jul 18, 2007 4:14 pm
Location: France

Post by lionel_om »

Hi all,

Here I've made a simple command lines program that could extract plain text or safe HTML (by removing scripts, heads, style) from a file containing the source of the Web Page.

Feel free to use it as you want !

Usage :

Code: Select all

program.exe <mode> -in <source_file> <out_file>
  mode = "-overview" or "-plaintext"
  <out_file> not defined: result displayed in the console
Code :

Code: Select all

; //   CONSTANTS
; // -----------------

Enumeration 0
  #MODE_NOT_DEFINED
  #MODE_PLAINTEXT
  #MODE_OVERVIEW
EndEnumeration




#FILE_NOT_FOUND_TAG = "<!-$$$ File Not Found $$$-!>"

#PARAM_FILE_IN   = "-in"
#PARAM_FILE_OUT  = "-out"
#PARAM_OVERVIEW  = "-overview"
#PARAM_PLAINTEXT = "-plaintext"



; //   PROCEDURES
; // -----------------



Procedure.s GetFileContent(FileName$, EndLineChars$ = #CRLF$, FileNotFound$ = #FILE_NOT_FOUND_TAG, KeepEmptyLines.b = #True)
  Protected Retour$ = "", hFile.l, Line$
  hFile = ReadFile(#PB_Any, FileName$)
  If hFile
    While Eof(hFile) = #Null
      Line$ = ReadString(hFile)
      If Len(Line$) Or KeepEmptyLines
        Retour$ + Line$ + EndLineChars$
      EndIf
    Wend
    CloseFile(hFile)
  Else
    ProcedureReturn FileNotFound$
  EndIf
  ProcedureReturn Retour$
EndProcedure

Procedure.s GetNewLinePattern(*char.Byte)
  Protected *next_.Byte
  While *char\b <> 0
    *next_ = *char + 1
    If *char\b = 10
      If *next_\b = 13
        ProcedureReturn #LFCR$
      Else
        ProcedureReturn #LF$
      EndIf
    ElseIf *char\b = 13
      If *next_\b = 10
        ProcedureReturn #CRLF$
      Else
        ProcedureReturn #CR$
      EndIf
    EndIf
    *char = *next_
  Wend
  ProcedureReturn Chr(13)
EndProcedure

Procedure.s Ereg_Replace(Text$, Pattern$, Replace$ = "", Options.l = #PB_RegularExpression_DotAll |  #PB_RegularExpression_Extended |  #PB_RegularExpression_AnyNewLine)
  hRegex = CreateRegularExpression(#PB_Any, Pattern$, Options)
  If hRegex
    Text$ = ReplaceRegularExpression(hRegex, Text$, Replace$)
    FreeRegularExpression(hRegex)
  Else
    Debug "Can't create a Regex with this pattern : " + Pattern$
  EndIf
  ProcedureReturn Text$
EndProcedure

Procedure.s GetPlainText(Text$)
  Text$ = Ereg_Replace(Text$, "\<head.+\/head\>")
  ;Text$ = Ereg_Replace(Text$, "\<style.+\/style\>")
  ;Text$ = Ereg_Replace(Text$, "\<script.+\/script\>")
  Text$ = Ereg_Replace(Text$, "(?iU)<(script|style)>.*</\1>")
  Text$ = Ereg_Replace(Text$, "(?iU)<.*>", " ") ;"<[^>]+>", " ")
  Text$ = Ereg_Replace(Text$, "[ \t\n\r]+", " ")
  ProcedureReturn Text$
EndProcedure

Procedure.s GetOverview(Text$)
  Text$ = Ereg_Replace(Text$, "<head.+\/head>")
  Text$ = Ereg_Replace(Text$, "<style.+\/style>")
  Text$ = Ereg_Replace(Text$, "<script.+\/script>")
  Text$ = Ereg_Replace(Text$, "<!DOCTYPE[^>]*>")
  ;Text$ = RemoveTagProperties(Text$)
  Text$ = Ereg_Replace(Text$, "<([a-zA-Z]+)\ *[^>]+>", "</\1>")
  ProcedureReturn Text$
EndProcedure

Procedure SendResult(Text$, Out$)
  If Out$
    hFile = CreateFile(#PB_Any, Out$)
    If hFile
      WriteString(hFile, Text$)
      CloseFile(hFile)
    EndIf
  Else
    OpenConsole()
    PrintN(Text$)
    Input()
  EndIf
EndProcedure




; //   INIT
; // -----------------

Mode.l   = #MODE_NOT_DEFINED
FileIN$  = #NULL$
FILEOUT$ = #NULL$




; //   MAIN
; // -----------------

nb = CountProgramParameters()-1
For i = 0 To nb
  Select ProgramParameter(i)
    
    Case #PARAM_PLAINTEXT
      Mode = #MODE_PLAINTEXT
    
    Case #PARAM_OVERVIEW
      Mode = #MODE_OVERVIEW
    
    Case #PARAM_FILE_IN
      If i < nb
        FileIN$ = ProgramParameter(i+1)
        i + 1
      EndIf
  
    Case #PARAM_FILE_OUT
      If i < nb
        FileOUT$ = ProgramParameter(i+1)
        i + 1
      EndIf
  
  EndSelect
Next

If Mode <> #MODE_NOT_DEFINED And FileIN$ And FileSize(FileIN$) > 0
  Text$ = GetFileContent(FileIN$, #CR$)
  If Text$ <> #FILE_NOT_FOUND_TAG
    
    If Mode = #MODE_OVERVIEW
      Text$ = GetOverview(Text$)
      SendResult(Text$, FileOUT$)
    ElseIf Mode = #MODE_PLAINTEXT
      Text$ = GetPlainText(Text$)
      SendResult(Text$, FileOUT$)
    EndIf
    
  Else
    Debug "File not found !"
  EndIf
Else
  Debug "Parameters are not corrects !"
EndIf
/Lio :)
Webmaster of Basic-univers
Post Reply