A search for "remove html" did not bring up anything useful, so I ask this here:
I got a string and want to remove everything that is HTML from it, be it a simple <p> or a long <a href>...link...</a>
Before I sit down and reinvent the wheel let me ask: Has anybody before me possibly already programmed that?
Remove all HTML tags from a string
-
- Enthusiast
- Posts: 169
- Joined: Sat Mar 14, 2015 11:53 am
- NicTheQuick
- Addict
- Posts: 1227
- Joined: Sun Jun 22, 2003 7:43 pm
- Location: Germany, Saarbrücken
- Contact:
Re: Remove all HTML tags from a string
Are we talking about plain HTML or XHTML? For XHTML you can use the XML library of Purebasic to first parse it and then extract only text you need.
There are also some Regular Expressions out there which can do the right thing in most cases: You can then use the RegularExpression library of Purebasic to remove all found tags.
There are also some Regular Expressions out there which can do the right thing in most cases: You can then use the RegularExpression library of Purebasic to remove all found tags.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Re: Remove all HTML tags from a string
Most of the existing tools try (with moderate success) to filter a whole page and therefore take a lot of trouble to filter out scripts and other single tags.StarWarsFan wrote: ↑Wed May 18, 2022 11:41 am A search for "remove html" did not bring up anything useful, so I ask this here:
I got a string and want to remove everything that is HTML from it, be it a simple <p> or a long <a href>...link...</a>
Before I sit down and reinvent the wheel let me ask: Has anybody before me possibly already programmed that?
If you only have only one string it is usually quite easy. Post a small example if needed, we can build you a custom regex if you don't know.
Re: Remove all HTML tags from a string
Still need to process pseudo-code
etc...
and number code
It is necessary to connect to the browser engine object and request the page text from it. In AutoIt3 this is done through an object (_IEBodyReadText). Or via the properties command "innertext" (_IEPropertyGet)
Code: Select all
"
&
<
>
¡
¢
£
©
and number code
Code: Select all
&#(\d+);
-
- Enthusiast
- Posts: 169
- Joined: Sat Mar 14, 2015 11:53 am
Re: Remove all HTML tags from a string
Okay I shall do that myself then, anybody can join in if he so desires.
Ideas:
Tactically, I would put the HTML source-code into a$
For easy cases like " " that is easy,
there a simple a$= removestring(a$," ") should do.
Or you treat it as "@" for the start-tag and ";" for the end-tag.
I then search for the next tag that must in HTML of course start with "<" and end with ">"
for i= 1 to len (a$) so that the entire string is worked.
If I find such a tag-start("<"), I would then search until it finds the tag-end (">") and construct two strings,
let me make an example for a simple <b>test-text</b>
If I have "<b>" I can simply construct how the end-tag must look like, that is easily done by r2$= replacestring(r1$,"<b","</b")
r1$ would result in "<b>"
r2$ would result in "</b>"
And then I can give
a$= removestring(a$,r1$)
a$= removestring(a$,r2$)
and continue the loop
BUT: Let us assume longer tags like <a href.....>
There you do not have that entire tag as its end, you have a simply </a> there.
I must somehow discriminate case 1 from case 2 (that has options included)
Maybe look for an existing space-charater...
Ideas:
Tactically, I would put the HTML source-code into a$
For easy cases like " " that is easy,
there a simple a$= removestring(a$," ") should do.
Or you treat it as "@" for the start-tag and ";" for the end-tag.
I then search for the next tag that must in HTML of course start with "<" and end with ">"
for i= 1 to len (a$) so that the entire string is worked.
If I find such a tag-start("<"), I would then search until it finds the tag-end (">") and construct two strings,
let me make an example for a simple <b>test-text</b>
If I have "<b>" I can simply construct how the end-tag must look like, that is easily done by r2$= replacestring(r1$,"<b","</b")
r1$ would result in "<b>"
r2$ would result in "</b>"
And then I can give
a$= removestring(a$,r1$)
a$= removestring(a$,r2$)
and continue the loop
BUT: Let us assume longer tags like <a href.....>
There you do not have that entire tag as its end, you have a simply </a> there.
I must somehow discriminate case 1 from case 2 (that has options included)
Maybe look for an existing space-charater...
Re: Remove all HTML tags from a string
You don't need to get anywhere near that complicated just to remove tags. You just need to traverse the string and set a flag when you enter or leave a tag. If you're in a tag, you omit the character from the output. If you aren't, you include it.
You will want to be more selective about entity codes though as you may change the semantic content of the string if you just remove them. For example removing @nbsp;, @ensp; or @emsp; will concatenate the two adjacent words. I'd use ReplaceString first to convert them to a space. Additionally you need to replace the < and > entities after you've removed all the tags otherwise they'll cause a fault.
You will want to be more selective about entity codes though as you may change the semantic content of the string if you just remove them. For example removing @nbsp;, @ensp; or @emsp; will concatenate the two adjacent words. I'd use ReplaceString first to convert them to a space. Additionally you need to replace the < and > entities after you've removed all the tags otherwise they'll cause a fault.
Code: Select all
Define.S a, b, c
Define.I l, i, tag = #False
a = "<p>This is a paragraph.@ensp;It contains some <i>italic</i> text and some <b>bold</b> text.@ensp;" +
"It also has a link to the PureBasic <a href=" + #DQUOTE$ + "www.purebasic.com" + #DQUOTE$ + ">website.</a>@ensp;" +
"PureBasic is @gt; Basic.</p>"
a = ReplaceString(a, "@ensp;", " ")
l = Len(a)
For i = 1 To l
c = Mid(a, i, 1)
If c = "<"
tag = #True
ElseIf c = ">"
tag = #False
ElseIf Not(tag)
b + c
EndIf
Next i
b = ReplaceString(b, "@gt;", ">")
Debug a
Debug b
Re: Remove all HTML tags from a string
Maybe a bit faster (not compared)
Code: Select all
Define.S a, b
Define.I tag = #False
Define *HtmlPtr.Character, *Pos1
a = "<p>This is a paragraph.@ensp;It contains some <i>italic</i> text and some <b>bold</b> text.@ensp;" +
"It also has a link to the PureBasic <a href=" + #DQUOTE$ + "www.purebasic.com" + #DQUOTE$ + ">website.</a>@ensp;" +
"PureBasic is @gt; Basic.</p>"
a = ReplaceString(a, "@ensp;", " ")
*HtmlPtr = @a
*Pos1 = @a
While *HtmlPtr\c
If *HtmlPtr\c = '<'
b + PeekS(*Pos1, (*HtmlPtr - *Pos1) >> 1)
ElseIf *HtmlPtr\c = '>'
*Pos1 = *HtmlPtr + 2
EndIf
*HtmlPtr + 2
Wend
b = ReplaceString(b, "@gt;", ">")
Debug a
Debug b
Re: Remove all HTML tags from a string
The below code uses built-in Windows functionality. Very simple demo at the end of the code
Code: Select all
; Extended WebBrowser Library functions by firace - partly adapted from code by freak
Define.s regkeyName, dwLabel, statusMsg, keyResult.i
Define.l dwValue, dwValueCheck
Define.l lastpressTimestamp = ElapsedMilliseconds()
Declare Async_OnPageChange()
Procedure.s RegReadString(HKMain, HKSub$, HKEntry$)
hKey = 0
If RegOpenKeyEx_(HKMain, HKSub$, 0, #KEY_QUERY_VALUE, @hKey) = #ERROR_SUCCESS
result$ = Space(4096)
bufLen = Len(result$)
If hKey
If RegQueryValueEx_(hKey, HKEntry$, 0, 0, @result$, @bufLen) <> #ERROR_SUCCESS
result$ = "Error reading Registry"
EndIf
RegCloseKey_(hKey)
EndIf
Else
result$ = "Error opening Registry key"
EndIf
ProcedureReturn result$
EndProcedure
ServiceVersionNumber$ = RegReadString(#HKEY_LOCAL_MACHINE,"SOFTWARE\Microsoft\Internet Explorer","svcUpdateVersion")
patchlevel = Val(Right(ServiceVersionNumber$,3))
regkeyName = "Software\Microsoft\Internet Explorer\Main\FeatureControl\Feature_Browser_Emulation\"
regkeyName3.s = "Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_RESTRICT_ABOUT_PROTOCOL_IE7\"
dwLabel = GetFilePart(ProgramFilename())
dwValue = 11001
RegOpenKeyEx_(#HKEY_CURRENT_USER, regkeyName, 0, #KEY_ALL_ACCESS, @keyResult)
RegSetValueEx_(keyResult, @dwLabel, 0, #REG_DWORD, @dwValue, SizeOf(Long))
UserAgent$ = "Mozilla/5.0 (Windows NT 6.1; WOW64) like Gecko"
UrlMkSetSessionOption_( $10000001 , Ascii(UserAgent$) , Len ( UserAgent$ ) , 0 )
#OLECMDID_PROPERTIES = 10
#olecmdid_find = 32
DataSection
IID_IHTMLElement: ; {3050F1FF-98B5-11CF-BB82-00AA00BDCE0B}
Data.l $3050F1FF
Data.w $98B5, $11CF
Data.b $BB, $82, $00, $AA, $00, $BD, $CE, $0B
IID_IHTMLDocument2: ; {332C4425-26CB-11D0-B483-00C04FD90119}
Data.l $332C4425
Data.w $26CB, $11D0
Data.b $B4, $83, $00, $C0, $4F, $D9, $01, $19
EndDataSection
CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
Import ""
MakeBSTR(str.p-unicode) As "_SysAllocString"
EndImport
CompilerElse
Import ""
MakeBSTR(str.p-unicode) As "SysAllocString"
EndImport
CompilerEndIf
;; -- begin iDispatch interface functions. partly adapted from code by freak
#DISPID_NAVIGATEERROR= 271
Structure DispatchFunctions
QueryInterface.l
AddRef.l
Release.l
GetTypeInfoCount.l
GetTypeInfo.l
GetIDsOfNames.l
Invoke.l
EndStructure
Structure DispatchObject
*IDispatch.IDispatch
ObjectCount.l
EndStructure
Procedure.l AddRef(*THIS.DispatchObject)
*THIS\ObjectCount + 1
ProcedureReturn *THIS\ObjectCount
EndProcedure
Procedure.l QueryInterface(*THIS.DispatchObject, *iid.GUID, *Object.LONG)
If CompareMemory(*iid,?IID_DWebBrowserEvents2,16)
; CallDebugger
EndIf
If CompareMemory(*iid, ?IID_IUnknown, SizeOf(GUID)) Or CompareMemory(*iid, ?IID_IDispatch, SizeOf(GUID))
*Object\l = *THIS
AddRef(*THIS.DispatchObject)
ProcedureReturn #S_OK
Else
*Object\l = 0
ProcedureReturn #E_NOINTERFACE
EndIf
EndProcedure
Procedure.l Release(*THIS.DispatchObject)
*THIS\ObjectCount - 1
ProcedureReturn *THIS\ObjectCount
EndProcedure
Procedure GetTypeInfoCount(*THIS.DispatchObject, pctinfo)
ProcedureReturn #E_NOTIMPL
EndProcedure
Procedure GetTypeInfo(*THIS.DispatchObject, iTInfo, lcid, ppTInfo )
ProcedureReturn #E_NOTIMPL
EndProcedure
Procedure GetIDsOfNames(*THIS.DispatchObject, riid, rgszNames, cNames, lcid, rgDispId) : EndProcedure
Procedure.s StringFromVARIANT(*var.VARIANT)
If VariantChangeType_(*var, *var, $2, #VT_BSTR) = #S_OK
Result$ = PeekS(*var\bstrVal, -1, #PB_Unicode)
SysFreeString_(*var\bstrVal)
Else
Result$ = "ERROR : Cannot convert VARIANT to String!"
EndIf
ProcedureReturn Result$
EndProcedure
Global NewList dispatchObject.DispatchObject()
Procedure Invoke(*THIS.DispatchObject, dispIdMember, riid, lcid, wFlags, *pDispParams.DISPPARAMS, pVarResult, pExcepInfo, puArgErr)
Select dispIDMember
EndSelect
EndProcedure
AddElement(DispatchObject())
DispatchObject()\IDispatch = ?dispatchFunctions
;/////////////////////////////////////////////////////////////////////////////////
Structure _IDocHostUIHandler
*vTable
ref.i
iDocHostUiHandler.iDocHostUiHandler
EndStructure
Procedure.i SetCustomDocHostUIHandler(id, vTableAddress)
Protected result=#E_FAIL, hWnd, iBrowser.IWebBrowser2, iDispatch.IDispatch, iDocument.IHTMLDocument2, iOLE.IOleObject, iDocHostUIHandler.IDocHostUIHandler
Protected iCustomDoc.ICustomDoc, iOLEClientSite.IOleClientSite, *this._IDocHostUIHandler
hWnd = GadgetID(id)
If hWnd
iBrowser = GetWindowLong_(hWnd, #GWL_USERDATA)
If iBrowser
If iBrowser\get_Document(@iDispatch) = #S_OK
If iDispatch\QueryInterface(?IID_IHTMLDocument2, @iDocument) = #S_OK
If iDocument\QueryInterface(?IID_IOleObject, @iOLE) = #S_OK
If iOLE\GetClientSite(@iOLEClientSite) = #S_OK
If iOLEClientSite\QueryInterface(?IID_IDocHostUIHandler, @iDocHostUIHandler) = #S_OK
If iDocument\QueryInterface(?IID_ICustomDoc, @iCustomDoc) = #S_OK
*this = AllocateMemory(SizeOf(_IDocHostUIHandler))
If *this
*this\vTable = vTableAddress
*this\iDocHostUiHandler = iDocHostUIHandler
iCustomDoc\SetUIHandler(*this)
result = #S_OK
Else
iDocHostUIHandler\Release()
EndIf
iCustomDoc\Release()
Else
iDocHostUIHandler\Release()
EndIf
EndIf
IOleClientSite\Release()
EndIf
iOLE\Release()
EndIf
iDocument\Release()
EndIf
iDispatch\Release()
EndIf
EndIf
EndIf
ProcedureReturn result
EndProcedure
;/////////////////////////////////////////////////////////////////////////////////
;iUnknown.
Procedure.i IDocHostUIHandler_QueryInterface(*this._IDocHostUIHandler, riid, *ppObj.INTEGER)
Protected hResult = #E_NOINTERFACE, iunk.iUnknown
If *ppObj And riid
*ppObj\i = 0
If CompareMemory(riid, ?IID_IUnknown, SizeOf(IID)) Or CompareMemory(riid, ?IID_IDocHostUIHandler, SizeOf(IID))
*ppObj\i = *this
*this\ref+1
hResult = #S_OK
EndIf
EndIf
ProcedureReturn hResult
EndProcedure
;iUnknown.
Procedure.i IDocHostUIHandler_AddRef(*this._IDocHostUIHandler)
*this\ref = *this\ref + 1
ProcedureReturn *this\ref
EndProcedure
;iUnknown.
Procedure.i IDocHostUIHandler_Release(*this._IDocHostUIHandler)
Protected refCount
*this\ref = *this\ref - 1
refCount = *this\ref
If *this\ref = 0
*this\iDocHostUiHandler\Release()
FreeMemory(*this)
EndIf
ProcedureReturn refCount
EndProcedure
Procedure.i IDocHostUIHandler_ShowUI(*this._IDocHostUIHandler, dwID, pActiveObject, pCommandTarget, pFrame, pDoc)
ProcedureReturn *this\iDocHostUiHandler\ShowUI(dwID, pActiveObject, pCommandTarget, pFrame, pDoc)
EndProcedure
Procedure.i IDocHostUIHandler_HideUI(*this._IDocHostUIHandler)
ProcedureReturn *this\iDocHostUiHandler\HideUI()
EndProcedure
Procedure.i IDocHostUIHandler_FilterDataObject(*this._IDocHostUIHandler, pDO, ppDORet)
ProcedureReturn *this\iDocHostUiHandler\FilterDataObject(pDO, ppDORet)
EndProcedure
DataSection
IID_IOleObject: ; 00000112-0000-0000-C000-000000000046
Data.l $00000112
Data.w $0000, $0000
Data.b $C0, $00, $00, $00, $00, $00, $00, $46
IID_IDocHostUIHandler: ; BD3F23C0-D43E-11CF-893B-00AA00BDCE1A
Data.l $BD3F23C0
Data.w $D43E, $11CF
Data.b $89, $3B, $00, $AA, $00, $BD, $CE, $1A
IID_ICustomDoc: ; 3050F3F0-98B5-11CF-BB82-00AA00BDCE0B
Data.l $3050F3F0
Data.w $98B5, $11CF
Data.b $BB, $82, $00, $AA, $00, $BD, $CE, $0B
IID_IHTMLDocument: ; {626FC520-A41E-11CF-A731-00A0C9082637}
Data.l $626FC520
Data.w $A41E, $11CF
Data.b $A7, $31, $00, $A0, $C9, $08, $26, $37
IID_NULL: ; {00000000-0000-0000-0000-000000000000}
Data.l $00000000
Data.w $0000, $0000
Data.b $00, $00, $00, $00, $00, $00, $00, $00
EndDataSection
;; -- end iDispatch interface functions. partly adapted from code by freak
Procedure.i WebHelpers_GetHTMLDocument2 (nGadget)
Protected oBrowser.IWebBrowser2 = GetWindowLongPtr_(GadgetID(nGadget), #GWL_USERDATA)
Protected oDocumentDispatch.IDispatch
Protected oHTMLDocument.IHTMLDocument2
Protected iBusy
Repeat
While WindowEvent(): Delay(0): Wend
oBrowser\get_Busy(@iBusy): Delay(10)
Until iBusy = #VARIANT_FALSE
If oBrowser
If oBrowser\get_document(@oDocumentDispatch) = #S_OK
If oDocumentDispatch\QueryInterface(?IID_IHTMLDocument2, @oHTMLDocument) = #S_OK
oDocumentDispatch\Release()
EndIf
EndIf
EndIf
ProcedureReturn oHTMLDocument
EndProcedure
Procedure.i WebHelpers_GetHTMLDocumentParent (nGadget)
Protected oHTMLDocument.IHTMLDocument2 = WebHelpers_GetHTMLDocument2 (nGadget)
Protected oWindow.IHTMLWindow2
If oHTMLDocument
oHTMLDocument\get_parentWindow(@oWindow)
EndIf
oHTMLDocument\Release()
ProcedureReturn oWindow
EndProcedure
Procedure WebHelpers_InvokeJS (nGadget, sScriptCode.s, sScriptLanguage.s = "JavaScript")
Protected oWindow.IHTMLWindow2 = WebHelpers_GetHTMLDocumentParent (nGadget)
Protected tVariant.VARIANT
If oWindow
oWindow\execScript ("0" , sScriptLanguage, @tVariant)
oWindow\execScript (sScriptCode, sScriptLanguage, @tVariant)
oWindow\Release()
EndIf
EndProcedure
Procedure.s WebHelpers_GetURL(WGN.l)
Protected WebObject.IWebBrowser2,Ptr.l
WebObject = GetWindowLong_(GadgetID(WGN), #GWL_USERDATA)
WebObject\get_LocationURL(@Ptr.l)
ProcedureReturn PeekS(Ptr)
EndProcedure
Procedure.s WebHelpers_GetJSV(Gadget, Name$)
Result$ = "ERROR"
Browser.IWebBrowser2 = GetWindowLong_(GadgetID(Gadget), #GWL_USERDATA)
If Browser\get_Document(@DocumentDispatch.IDispatch) = #S_OK
If DocumentDispatch\QueryInterface(?IID_IHTMLDocument, @Document.IHTMLDocument) = #S_OK
If Document\get_Script(@Script.IDispatch) = #S_OK
bstr_name = MakeBSTR(Name$)
result = Script\GetIDsOfNames(?IID_NULL, @bstr_name, 1, 0, @dispID.l)
If result = #S_OK
params.DISPPARAMS\cArgs = 0
params\cNamedArgs = 0
result = Script\Invoke(dispID, ?IID_NULL, 0, #DISPATCH_PROPERTYGET, @params, @varResult.VARIANT, 0, 0)
If result = #S_OK
Result$ = StringFromVARIANT(@varResult)
Else
Message$ = Space(3000)
FormatMessage_(#FORMAT_MESSAGE_IGNORE_INSERTS|#FORMAT_MESSAGE_FROM_SYSTEM, 0, result, 0, @Message$, 3000, 0)
Result$ = "ERROR: Invoke() "+Message$
EndIf
Else
Message$ = Space(3000)
FormatMessage_(#FORMAT_MESSAGE_IGNORE_INSERTS|#FORMAT_MESSAGE_FROM_SYSTEM, 0, result, 0, @Message$, 3000, 0)
Result$ = "ERROR: GetIDsOfNames() "+Message$
EndIf
SysFreeString_(bstr_name)
Script\Release()
EndIf
Document\Release()
EndIf
DocumentDispatch\Release()
EndIf
ProcedureReturn Result$
EndProcedure
Procedure.s WebHelpers_GetInnerText(webgadget) ;;; JavaScript code compatible with IE11 and CEF
script2.s + "var jResult=document.documentElement.innerText; "
script2.s + "if (window.frames.length) { "
script2.s + " for (var xx = 0 ; xx < window.frames.length ; xx++) "
script2.s + " {jResult = jResult + '\n\n' + window.frames[xx].document.documentElement.innerText;}"
script2.s + "} "
script2.s + "jResult; "
WebHelpers_InvokeJS (webgadget, script2.s)
ProcedureReturn WebHelpers_GetJSV(webgadget,"jResult")
EndProcedure
DataSection
dispatchFunctions:
Data.i @QueryInterface(),@AddRef(),@Release(),@GetTypeInfoCount()
Data.i @GetTypeInfo(),@GetIDsOfNames(),@Invoke()
IID_IWebBrowser2:
Data.l $D30C1661
Data.w $CDAF, $11D0
Data.b $8A, $3E, $00, $C0, $4F, $C9, $E2, $6E
IID_IConnectionPointContainer:
Data.l $B196B284
Data.w $BAB4, $101A
Data.b $B6, $9C, $00, $AA, $00, $34, $1D, $07
IID_IDispatch:
Data.l $00020400
Data.w $0000, $0000
Data.b $C0, $00, $00, $00, $00, $00, $00, $46
IID_IUnknown:
Data.l $00000000
Data.w $0000, $0000
Data.b $C0, $00, $00, $00, $00, $00, $00, $46
IID_DWebBrowserEvents2:
Data.l $34A715A0
Data.w $6587, $11D0
Data.b $92, $4A, $00, $20, $AF, $C7, $AC, $4D
EndDataSection
Procedure.s RemoveTagsFromString(texttoprocess$)
OpenWindow(99, 100, 100, 140, 80, "",#PB_Window_Invisible)
WebGadget(0, 0, 0, 100, 100, "")
SetGadgetItemText(0, #PB_Web_HtmlCode, texttoprocess$)
ProcedureReturn WebHelpers_GetInnerText(0)
EndProcedure
;;; demo ;;;
Debug RemoveTagsFromString("<B>Hello </B> <a href=https://www.google.com/>Click here</a>")