Page 2 of 2

Posted: Tue Aug 05, 2008 12:44 pm
by Perkin
Hi, I've just adapted an example from the codearchive

Does this do what you want?

Code: Select all

; English forum:http://www.purebasic.fr/english/viewtopic.php?p=35682 
; Author: Freak (updated for PB 4.00 by mardanny71, description added by Andre)
;Adapted: Perkin (05/08/08)
; Date: 26. August 2003 
; OS: Windows 
; Demo: No 

;---------------------------------------------------------------------------------------------------- 
; Note about using the "copy" function:
; This will copy ALL text into the clipboard and so will display its contents 
; in the messagerequester later.

Enumeration 1 
  #OLECMDID_OPEN          
  #OLECMDID_NEW        
  #OLECMDID_SAVE          
  #OLECMDID_SAVEAS            
  #OLECMDID_SAVECOPYAS    
  #OLECMDID_PRINT        
  #OLECMDID_PRINTPREVIEW        
  #OLECMDID_PAGESETUP        
  #OLECMDID_SPELL            
  #OLECMDID_PROPERTIES  
  #OLECMDID_CUT          
  #OLECMDID_COPY        
  #OLECMDID_PASTE            
  #OLECMDID_PASTESPECIAL    
  #OLECMDID_UNDO            
  #OLECMDID_REDO          
  #OLECMDID_SELECTALL        
  #OLECMDID_CLEARSELECTION 
  #OLECMDID_ZOOM            
  #OLECMDID_GETZOOMRANGE      
  #OLECMDID_UPDATECOMMANDS  
  #OLECMDID_REFRESH            
  #OLECMDID_STOP              
  #OLECMDID_HIDETOOLBARS      
  #OLECMDID_SETPROGRESSMAX    
  #OLECMDID_SETPROGRESSPOS  
  #OLECMDID_SETPROGRESSTEXT    
  #OLECMDID_SETTITLE          
  #OLECMDID_SETDOWNLOADSTATE  
  #OLECMDID_STOPDOWNLOAD      
EndEnumeration 

Enumeration 0 
  #OLECMDEXECOPT_DODEFAULT      
  #OLECMDEXECOPT_PROMPTUSER        
  #OLECMDEXECOPT_DONTPROMPTUSER    
  #OLECMDEXECOPT_SHOWHELP        
EndEnumeration 

; ------------------------------------------------------------------------------------------ 

; Now the code 

#WebGadget = 1 
#Button = 2 

OpenWindow(0, 0, 0, 800, 600,"WebBrowser" ,#PB_Window_ScreenCentered|#PB_Window_SystemMenu ) 

CreateGadgetList(WindowID(0)) 

  WebGadget(#WebGadget, 10, 40, 780, 550, "www.purebasic.com") 
  ButtonGadget(#Button, 10, 10, 60, 20, "Copy") 

; Fred the genius stored the Interface pointer to IWebBrowser2 in the DATA 
; member of the windowstructure of the WebGadget containerwindow, so we can get 
; that easily: 
  
WebObject.IWebBrowser2 = GetWindowLong_(GadgetID(#WebGadget), #GWL_USERDATA) 

Repeat 
  Event = WaitWindowEvent() 
  If Event = #PB_Event_Gadget And EventGadget() = #Button 
    
    ; Now here's the actual copy thing, not that complicated... 
    WebObject\ExecWB(#OLECMDID_SELECTALL, #OLECMDEXECOPT_DODEFAULT, 0, 0) 
    WebObject\ExecWB(#OLECMDID_COPY, #OLECMDEXECOPT_DODEFAULT, 0, 0) 
    WebObject\ExecWB(#OLECMDID_CLEARSELECTION, #OLECMDEXECOPT_DONTPROMPTUSER, 0, 0) 
    
    fullcontents.s=GetClipboardText()
    ;Debug fullcontents

    ; little test: 
    MessageRequester("", fullcontents, 0) 
    
  EndIf 
Until Event = #PB_Event_CloseWindow 

End 

Posted: Tue Aug 05, 2008 2:43 pm
by PB
@Perkin: That's not reliable. Run it once, then exit and change the URL and
run it again, and the previous copied text is still there instead of the new.

Posted: Tue Aug 05, 2008 7:33 pm
by Sparkie
@Perkin: If you want to use your method you'll have to send a mouse click to ensure the document has focus, otherwise you'll experience the problem PB has described.

Posted: Wed Aug 06, 2008 12:27 pm
by AND51
Here's a simple solution with RegularEpxressions.
Tasks like this is what RegularExpressions are specialised for!

This is just a simple demonstration, don't forget to include FreeRegularExpression() and CloseFile().

Code: Select all

EnableExplicit

InitNetwork()

If Not ReceiveHTTPFile("http://www.google.de", "C:\index.html") ; RECEIVING AND SAVING TO FILE
  Debug "HTML cannot be downloaded"
  End
EndIf

Define html$

If ReadFile(0, "C:\index.html") ; READING AND REMOVING HTML
  While Not Eof(0)
    html$+ReadString(0)
  Wend
  CreateRegularExpression(0, "(?iU)<(script|style)>.*</\1>") ; Remove JavaScripts and Stylesheets
  CreateRegularExpression(1, "(?iU)<.*>") ; Remove HTML-Tags
  html$=ReplaceRegularExpression(0, html$, "")
  html$=ReplaceRegularExpression(1, html$, "")
  Debug html$
EndIf

If CreateFile(0, "C:index.html") ; WRITE BACK TO FILE
  WriteStringN(0, html$)
  RunProgram("C:\index.html")
EndIf

Posted: Thu Nov 06, 2008 9:14 pm
by TerryHough
Sparkie wrote:@Perkin: If you want to use your method you'll have to send a mouse click to ensure the document has focus, otherwise you'll experience the problem PB has described.
How do I ensure the document has focus.

My routine does everything I need, but I have to manually click on the document first. Then everything works as expected.

I would like to make that focus automatic.

Can someone give me a clue?

Posted: Thu Nov 06, 2008 10:50 pm
by PB
The code I posted works great and doesn't need focussing at all. Don't know
why people seem to be ignoring it for other non-working solutions. (Even the
regular expression one doesn't work properly).

Posted: Fri Nov 07, 2008 3:35 pm
by TerryHough
PB wrote: Don't know
why people seem to be ignoring it for other non-working solutions.
PB,
I assure you that I didn't ignore your code. I tried it, and it works for some web pages. But, it didn't work with the one I needed to convert at all. It loads the web page, but convert doesn't find a thing. Probably something about how the page is generated.

The code from Perkin works perfectly, but only after giving the document focus.

So, I just need a way to make the focus automatic and I am stumped.

Posted: Fri Nov 07, 2008 5:06 pm
by lionel_om
Hi all,

Here I've made a simple command lines program that could extract plain text or safe HTML (by removing scripts, heads, style) from a file containing the source of the Web Page.

Feel free to use it as you want !

Usage :

Code: Select all

program.exe <mode> -in <source_file> <out_file>
  mode = "-overview" or "-plaintext"
  <out_file> not defined: result displayed in the console
Code :

Code: Select all

; //   CONSTANTS
; // -----------------

Enumeration 0
  #MODE_NOT_DEFINED
  #MODE_PLAINTEXT
  #MODE_OVERVIEW
EndEnumeration




#FILE_NOT_FOUND_TAG = "<!-$$$ File Not Found $$$-!>"

#PARAM_FILE_IN   = "-in"
#PARAM_FILE_OUT  = "-out"
#PARAM_OVERVIEW  = "-overview"
#PARAM_PLAINTEXT = "-plaintext"



; //   PROCEDURES
; // -----------------



Procedure.s GetFileContent(FileName$, EndLineChars$ = #CRLF$, FileNotFound$ = #FILE_NOT_FOUND_TAG, KeepEmptyLines.b = #True)
  Protected Retour$ = "", hFile.l, Line$
  hFile = ReadFile(#PB_Any, FileName$)
  If hFile
    While Eof(hFile) = #Null
      Line$ = ReadString(hFile)
      If Len(Line$) Or KeepEmptyLines
        Retour$ + Line$ + EndLineChars$
      EndIf
    Wend
    CloseFile(hFile)
  Else
    ProcedureReturn FileNotFound$
  EndIf
  ProcedureReturn Retour$
EndProcedure

Procedure.s GetNewLinePattern(*char.Byte)
  Protected *next_.Byte
  While *char\b <> 0
    *next_ = *char + 1
    If *char\b = 10
      If *next_\b = 13
        ProcedureReturn #LFCR$
      Else
        ProcedureReturn #LF$
      EndIf
    ElseIf *char\b = 13
      If *next_\b = 10
        ProcedureReturn #CRLF$
      Else
        ProcedureReturn #CR$
      EndIf
    EndIf
    *char = *next_
  Wend
  ProcedureReturn Chr(13)
EndProcedure

Procedure.s Ereg_Replace(Text$, Pattern$, Replace$ = "", Options.l = #PB_RegularExpression_DotAll |  #PB_RegularExpression_Extended |  #PB_RegularExpression_AnyNewLine)
  hRegex = CreateRegularExpression(#PB_Any, Pattern$, Options)
  If hRegex
    Text$ = ReplaceRegularExpression(hRegex, Text$, Replace$)
    FreeRegularExpression(hRegex)
  Else
    Debug "Can't create a Regex with this pattern : " + Pattern$
  EndIf
  ProcedureReturn Text$
EndProcedure

Procedure.s GetPlainText(Text$)
  Text$ = Ereg_Replace(Text$, "\<head.+\/head\>")
  ;Text$ = Ereg_Replace(Text$, "\<style.+\/style\>")
  ;Text$ = Ereg_Replace(Text$, "\<script.+\/script\>")
  Text$ = Ereg_Replace(Text$, "(?iU)<(script|style)>.*</\1>")
  Text$ = Ereg_Replace(Text$, "(?iU)<.*>", " ") ;"<[^>]+>", " ")
  Text$ = Ereg_Replace(Text$, "[ \t\n\r]+", " ")
  ProcedureReturn Text$
EndProcedure

Procedure.s GetOverview(Text$)
  Text$ = Ereg_Replace(Text$, "<head.+\/head>")
  Text$ = Ereg_Replace(Text$, "<style.+\/style>")
  Text$ = Ereg_Replace(Text$, "<script.+\/script>")
  Text$ = Ereg_Replace(Text$, "<!DOCTYPE[^>]*>")
  ;Text$ = RemoveTagProperties(Text$)
  Text$ = Ereg_Replace(Text$, "<([a-zA-Z]+)\ *[^>]+>", "</\1>")
  ProcedureReturn Text$
EndProcedure

Procedure SendResult(Text$, Out$)
  If Out$
    hFile = CreateFile(#PB_Any, Out$)
    If hFile
      WriteString(hFile, Text$)
      CloseFile(hFile)
    EndIf
  Else
    OpenConsole()
    PrintN(Text$)
    Input()
  EndIf
EndProcedure




; //   INIT
; // -----------------

Mode.l   = #MODE_NOT_DEFINED
FileIN$  = #NULL$
FILEOUT$ = #NULL$




; //   MAIN
; // -----------------

nb = CountProgramParameters()-1
For i = 0 To nb
  Select ProgramParameter(i)
    
    Case #PARAM_PLAINTEXT
      Mode = #MODE_PLAINTEXT
    
    Case #PARAM_OVERVIEW
      Mode = #MODE_OVERVIEW
    
    Case #PARAM_FILE_IN
      If i < nb
        FileIN$ = ProgramParameter(i+1)
        i + 1
      EndIf
  
    Case #PARAM_FILE_OUT
      If i < nb
        FileOUT$ = ProgramParameter(i+1)
        i + 1
      EndIf
  
  EndSelect
Next

If Mode <> #MODE_NOT_DEFINED And FileIN$ And FileSize(FileIN$) > 0
  Text$ = GetFileContent(FileIN$, #CR$)
  If Text$ <> #FILE_NOT_FOUND_TAG
    
    If Mode = #MODE_OVERVIEW
      Text$ = GetOverview(Text$)
      SendResult(Text$, FileOUT$)
    ElseIf Mode = #MODE_PLAINTEXT
      Text$ = GetPlainText(Text$)
      SendResult(Text$, FileOUT$)
    EndIf
    
  Else
    Debug "File not found !"
  EndIf
Else
  Debug "Parameters are not corrects !"
EndIf
/Lio :)