Parsers

blueznl · Post by **blueznl** » Fri Dec 24, 2010 1:21 am

I know how to parse something slowly, I'm just wondering how to parse something fast

As for an example, what would be the smartest / fastest way to parse a few lines, for example a little piece of PureBasic code, let's say I would be writing an interpreter processing the following:

Code: Select all

OpenWindow(1,10,10,100,100,''TEST'',#PB_Window_ScreenCentered|#PB_Window_SystemMenu)
a = 0
a = a+1
Debug a

Suggestions welcome

IdeasVacuum · Post by **IdeasVacuum** » Fri Dec 24, 2010 1:39 am

... depends on the purpose, no? If you were parsing a PB source code to create, say, a C/C++ source code, you'd need at least a double- barrel approach I think - a 'dictionary' of keywords which are a straight swap, and some intelligent functions, for each specific bespoke code method that is not shared by the two languages. Or have I misinterpreted your requirement?

blueznl · Post by **blueznl** » Fri Dec 24, 2010 2:10 am

Hmm. First step is probably to discern between commands and expressions etcetera.

Code: Select all

a = a+1
OpenWindow(...)
StartDrawing

So, how to discern between those variations?

Little John · Post by **Little John** » Fri Dec 24, 2010 3:25 am

Hi blueznl,

does this code by srod do what you want (see also modification at the end of that page) ?

Regards, Little John

blueznl · Post by **blueznl** » Fri Dec 24, 2010 9:11 am

In a way, yes, but wouldn't it be faster to first discern between expressions and commands to speed up the process? I mean, what is the fastest way to see what is an expression and what is not?

Code: Select all

expressions...

a = a+1

commands (with or without parenthesis)...

StartDrawing
OpenWindow(...)
a = OpenWindow(...)

procedures...

MyProcedure(..)

For reasons I have to do this almost realtime, I can't parse all code then sit back and relax, unfortunately.

Edit: well, perhaps speed might not as much the issue as I thought, I need to rethink my strategy...

Trond · Post by **Trond** » Fri Dec 24, 2010 3:22 pm

I'd like to know how you would do it slowly, so I know what the reference range for slow and fast is.
Normaly the slow way is recursive decent parsing, which is easy to write by hand. The fast way is to first provide a lexer (very simple parser that splits up the code into keywords, numbers, names, and so on) and parse the output of the lexer with a bottom-up parser like LR (see bottom-up parsing and lr parser on wikipedia).

tranquilChaos · Post by **tranquilChaos** » Sat Mar 12, 2011 4:30 pm

To get an understanding how to most efficiently do this, I would suggest you read-up on grammars - the most commonly discussed is Context Free Grammar (CFG). This will show you how to break your code into trees that can be parsed based on the grammar constructs.

To get started, I would say start smaller for now. Drop the OpenWindow(param, param, param) function call. Focus on an interpreter that will allow you to declare a variable and assign to it. It's kind of confusing but in execution, a function exists in the stack like a variable. A variable is a pointer to a memory location that holds data, a function is a pointer to a memory location that holds data (code to be evaluated). So just keep it simple for now.

There will be two components of the simple interpreter: tokens and symbols. Tokens are the operations, like = or +. You will need to build a list of tokens to operate on. You can put them in an array or linked list. The next part, the symbols, are anything the user defines, like variables; they are symbolic of a value. You will likely want to store them in a hash map so they can be quickly located.

So your first step (after you have created your token table and reserved space for symbols) will be to read in the line to parse. From your example:

a = 0

We want to break-up the tokens and symbols. This is called lexxing. This particular example is easy, we can split the string on spaces. The result should be an array with elements like this:

[a][=][0]

The next step is called parsing. You will evaluate each element and act on it in a way that makes sense for your interpreter. In this case, read in the first element and scan the token list. You have no token named 'a' so you know it's a symbol. Add an entry in the hash map for the symbol 'a'. You will need to do a check to see if the entry already exists. The next element IS in the token table and the code should know it's an assignment operator and it sets a flag that the next element is the value assigned to the symbol 'a'. Using this approach you could then add a simple function called 'PRINT' to your token table. Then if you add:

PRINT a

Using the same technique as above, it's lexxed as [PRINT][a], then the parser finds the token 'PRINT' and knows the next element is a symbol. It grabs 'a' from the symbol hash map and operates on it (in this case, printing the value of a to the screen).

Does this help? I've written a small C compiler in C++ for college about a year ago. I used linked list for my token look-ups and a balanced binary tree for my symbol table. It was really fast. I'm pretty new to PB so some of the terminology might be off, but the approach should be the same. I'm working on small projects to learn it right now, but I could put together a simple interpreter like this and share it if you're still interested.

inc. · Post by **inc.** » Mon Mar 14, 2011 11:20 am

Just have a look at Remi Meiers "Lexer" which does a superior job on tokenizing PB Sources, even '#' Marks within Macros are handled correctly.

PureBasic Forums - English

Parsers

Parsers

Re: Parsers

Re: Parsers

Re: Parsers

Re: Parsers

Re: Parsers

Re: Parsers

Re: Parsers