Page 1 of 2

Posted: Wed Oct 30, 2002 4:14 pm
by BackupUser
Restored from previous forum. Originally posted by Art Sentinel.

Hi everyone,

OK. I am trying to figure out the fastest way to read an HTML file (any file that the user specifies--this will be dynamic and could be on the hard drive or on a remote server). Next parse the file to separate the words contained within the different sections into their own logical groups (the meta tags, the body, the comments, the alt text, etc.). Finally ignore the HTML tags that are left, count the words from each section, and simply write another HTML file that has the % results listed in a convenient table.

Most of this is not hard at all. Writing the HTML is a breeze. Displaying the written HTML file within the internal web gadget is simple. The trouble is that I cannot yet find a FAST and reliable way to parse the file and count the words.

A friend of mine suggested adding the words to a database. This is fine, but I don't like the idea of having such dependencies for the application. A quick, tiny, no-nonsense app is the best solution.

What can you think of to do this within PureBasic alone?

Any help you can give me will be greatly appreciated. :) Thank you!

Take care.


- Art Sentinel
http://www.artsentinel.net

.

.

.

By the way, HAPPY HALLOWEEN!

.

"Resistance is futile! Surrender your treats or be assimilated!"



--------------

Top Ten Reasons Not To Procrastinate:


Coming Soon...

Posted: Wed Oct 30, 2002 6:08 pm
by BackupUser
Restored from previous forum. Originally posted by fweil.

Hello,

Not sure to find the fastest way, but as I'm used to parse and build HTML pages using Purebasic, I will not drop you ...

Maybe tomorrow I will post a first rough outline.

KRgrds

Francois Weil
14, rue Douer
F64100 Bayonne

Posted: Wed Oct 30, 2002 10:02 pm
by BackupUser
Restored from previous forum. Originally posted by cor.

I'am also looking for a fast way to parse and replace the following by some code.

[%BACKGROUNDCOLOR%] must be replaced by some predefined or generated code.



[%BACKGROUNDCOLOR%]



Using Windows 98 SE
Registered PB version : 3.40 (Windows)
--------------------------
C. de Visser
Author of Super Guitar Chord Finder
http://www.ready4music.com

Posted: Thu Oct 31, 2002 7:05 am
by BackupUser
Restored from previous forum. Originally posted by fweil.

...,

I have a web site generator for my own purpose and my customers' sites using two technics :

- the first one is for building pages at start using a descriptor file containing tags to be associated with templates files. Those tags are delimited by leading and trialing $$$ (not because it makes a lot of money but ... !).

In the descriptor file, there is the description of files to create and 'variables' to use with equivalence of variables (ie : $$$text$$$ Mytext). In templates files you find the position of $$$...$$$ variables in a HTML template (where JS code can also be placed for example).

Then the constructor program can associate templates and variables values to build pages in files.

- the second usage of it is to manage content of course. When building page, I replace a variable by it's value not forgetting to put a HTML valid tag (ie : MyText. Then it's easy to find and replace the searched text by another using a simple user interface. And the content management is easy to do.

I added recently a new feature using a variable name $$$File$$$ so that it is possible to define first :

$$$File$$$ path/filename
$$$Text$$$ $$$File$$$

in the descriptor file, and when building the page to replace $$$Text$$$ by the content of the file given by $$File$$$. It is now possible to place CSV files for example in a text area of a page. Just one more trick, the data file named in $$$File$$$ contains its own internal template to fit in a HTML corresponding to the page it is expected to be placed.

So now I can build quickly web sites with fixed content and also make quick updates either on static parts and some dynamic content using data files.

As Art Sentinel's question is not so deep, I will probably come back on it for Halloween.

Rgrds

Francois Weil
14, rue Douer
F64100 Bayonne

Posted: Thu Oct 31, 2002 12:11 pm
by BackupUser
Restored from previous forum. Originally posted by fweil.

Hello,

So now I think I am able to parse a text file like King Jame's Holy Bible well to find 856121 words, 14358 unique words, in 100117 lines in 40 seconds. I used the plain text file version.

I was trying to find a big text which could be used as a reference. http://www.zuppa.co.uk/religion/bible/ gives to versions of this text, one in plain text and the other in HTML pages.

I also tested http://www.holybiblecentral.org/holybiblekjv.zip using holybiblekjv.txt with 853725 words, 15094 unique words on 31347 lines parsed in 27 seconds.

Using the HTML version of this text (splitted in 66 pages), I found 1020437 words, 14478 unique words on 114711 lines in 44 seconds. The parsing of HTML files does not detect tags or anything HTML oriented right now in my code.

Don't know if it is a good way to starts in benchmarking Purebasic code ...

I have a 1,2GHz processor on W2K with a 192 MB Ram.

I firstly tested a code using a linked list but found it too slow and decided to use string arrays. I also tested StringField which seems to be very fast.

For detecting words I use several ReplaceString() which could maybe be replaced by a faster ASM procedure obviously I am not sure to be able to code it by myself.

The first file parsing is stored in a table that is sorted using SortArray. The sort result is copied in a string array containing unique words and another integer array containing each word count.

I don't know if anyone started the game, but I find my results rather impressive ...

Want to play ?

So if you think such files I used for benchmarking are effective maybe we can go on with or let me know if any other idea.

I will now go on a HTML parsing before to post any source code. But if you are interested by my first one, tell me.

KRgrds

Francois Weil
14, rue Douer
F64100 Bayonne

Posted: Thu Oct 31, 2002 5:10 pm
by BackupUser
Restored from previous forum. Originally posted by Franco.

Impressive results Francois.

Have a nice day...

Franco

Posted: Thu Oct 31, 2002 8:10 pm
by BackupUser
Restored from previous forum. Originally posted by Pupil.

Managed to count the words in King Jame's Holy Bible(plain text version) in about 15 seconds on an AMD duron 800, 384 MB, didn't get the same count as fweil though, probably because of differences in which characters to discard as unwordy :wink:
Anyway i counted 100117 lines and 809825 word whereof 13749 words were unique.

Oh and i didn't sort the unique words, just saved them as they appeared in the list.

Posted: Fri Nov 01, 2002 2:06 am
by BackupUser
Restored from previous forum. Originally posted by Fangbeast.
Originally posted by cor

I'am also looking for a fast way to parse and replace the following by some code.

[%BACKGROUNDCOLOR%] must be replaced by some predefined or generated code.



[%BACKGROUNDCOLOR%]

Cor (and others), I don't know if this helps or not but as I already told Art, I embed my HTML code as data statements in my software (because my base HTML form never changes) and this is very fast to read with compared to an external file. (just a few rules to follow)

Also, someone uploaded a very fast ReadString() replacement to Paul's site that can read in an entire file to memory that might help you as well.

Fangles woz ear orright den?

Posted: Fri Nov 01, 2002 9:48 am
by BackupUser
Restored from previous forum. Originally posted by cor.

Fangbeast,

file to memory snippet works great

Thanks



Using Windows 98 SE
Registered PB version : 3.40 (Windows)
--------------------------
C. de Visser
Author of Super Guitar Chord Finder
http://www.ready4music.com

Posted: Fri Nov 01, 2002 12:17 pm
by BackupUser
Restored from previous forum. Originally posted by Fangbeast.
Originally posted by cor

Fangbeast,

file to memory snippet works great

Thanks

Using Windows 98 SE
Registered PB version : 3.40 (Windows)
--------------------------
C. de Visser
Author of Super Guitar Chord Finder
http://www.ready4music.com
:):) (If I keep this up, I might rate another star soon) :)

Fangles woz ear orright den?

Posted: Fri Nov 01, 2002 1:43 pm
by BackupUser
Restored from previous forum. Originally posted by fweil.

...,

Now I have 15 seconds parsing the whole King Jame's text file with unique words table sorted and each word count processed.

I replaced the list of ReplaceString() and the loop of StringField() instructions by hand coded algorithms which goes faster (I mean processing each line in one step instead of making a 'list of' and looping).

Waow !

Francois Weil
14, rue Douer
F64100 Bayonne

Posted: Fri Nov 01, 2002 2:44 pm
by BackupUser
Restored from previous forum. Originally posted by Pupil.

I now parse the King jame's text file in approx 5 seconds, word table not really sorted but adding it will not up the time by many millisec i think, other than that the features don't differ from fweil's.
I now load the entire file into memory and process it there.. I don't have much more areas to tweak now i think, execept going for pure ASM or inventing some nifty new algorithm..

Test made on the same processor as stated in my previous post...

Posted: Fri Nov 01, 2002 4:24 pm
by BackupUser
Restored from previous forum. Originally posted by fweil.

Pupil,

Just for saving me brain ressources, would you like just to show lines where you load the file in memory ... or give a link to appropriate tricks'n tips.

I am lazy today !

KRgrds

Francois Weil
14, rue Douer
F64100 Bayonne

Posted: Fri Nov 01, 2002 5:02 pm
by BackupUser
Restored from previous forum. Originally posted by Pupil.
Originally posted by fweil

Pupil,

Just for saving me brain ressources, would you like just to show lines where you load the file in memory ... or give a link to appropriate tricks'n tips.

I am lazy today !
;)
As per request i'll post my algorithm here, for all to enjoy and improve...

Code: Select all

Structure Chartype
	StructureUnion
		c.b
		d.b
		w.w
	EndStructureUnion
EndStructure

Structure UniqueWordsType
	Word.s
	Count.l
EndStructure

Global UniqueWords.l, TotalWords, TotalLines.l
Dim hash.l(65536)
Dim charhash.b(255)
NewList WordCount.UniqueWordsType()

Declare ParseString(*ptr.CharType)
Declare SortWords()


; create hash for valid characters
For i = 0 To 255
	If (i >= 'A' And i = 'a' And i  ""
	startdate.l = Date()
	If ReadFile(0, filename)
		length.l = Lof()
		*buffer = AllocateMemory(0, length+2)
		If *buffer
			UniqueWords = 0 : TotalWords = 0 : TotalLines = 0
			ReadData(*buffer, length)
			ParseString(*buffer)
			FreeMemory(0)
		EndIf
		CloseFile(0)
	EndIf
	
	ResetList(WordCount())
	Dim array.s(UniqueWords)
	i = 0
	While NextElement(WordCount())
		array(i) = WordCount()\Word+", "+Str(WordCount()\Count)
		i+1
	Wend
	SortArray(array(), 0)
	
	If CreateFile(1, "uniqueword.txt")
		WriteStringN("Lines: "+Str(TotalLines))
		WriteStringN("Words: "+Str(TotalWords))
		WriteStringN("Unique words: "+Str(UniqueWords))

		For i = 0 To UniqueWords
			WriteStringN(array(i))
		Next
		CloseFile(1)
	EndIf
	enddate.l = Date()
	msg.s = "Found "+Str(UniqueWords)+" unique words out of "+Str(TotalWords)+" words,"+Chr(10)
	msg + "in "+Str(TotalLines)+" lines."+Chr(10)
	msg + "Started:"+FormatDate("%yy/%mm/%dd, %hh:%ii:%ss", startdate)+Chr(10)
	msg + "Stoped:"+FormatDate("%yy/%mm/%dd, %hh:%ii:%ss", enddate)
	MessageRequester("Info", msg, 0)
Else
	MessageRequester("Error", "No file selected.", 0)
EndIf

End

Procedure ParseString(*ptr.CharType)
DefType.l start, quit
DefType.s word
DefType.CharType *p, *d
	start = *ptr
	While *ptr\c  0
		If charhash(*ptr\c)
			start = *ptr
			Repeat
				*ptr+1
			Until charhash(*ptr\c) = #FALSE Or *ptr\c = 0
			word = LCase(PeekS(start, *ptr-start))
			If word  ""
				*p = @word
				TotalWords+1
				If hash(*p\w)
					ChangeCurrentElement(WordCount(), hash(*p\w))
					quit = #FALSE
					Repeat
						*d = @WordCount()\Word
						If WordCount()\Word Edited Changed source to sort output words for better overview...

Posted: Fri Nov 01, 2002 5:16 pm
by BackupUser
Restored from previous forum. Originally posted by Pupil.

Hey fweil, can you tell me what kind of time you get with the above code? My source don't report as many unique words as yours, but i haven't got your version so i couldn't check what the differences were, perhaps you can see if my source contain errors that might explain this?