Best approach to create a syntax reader for a code converter?

skinkairewalker · Post by **skinkairewalker** » Wed Aug 13, 2025 5:07 am

Hi everyone,

I’m working on an academic project where I need to build a code converter between different programming languages.
The first step is to create a syntax reader to identify structures, variables, functions, types, etc.

My question is: what do you think is the most efficient approach for this kind of parsing?

Regex – fast for simple patterns but can become messy with more complex syntax.

Lexer + Parser – reading character by character, generating tokens, and then interpreting them via syntax analysis.

AST (Abstract Syntax Tree) – building a syntax tree to make conversion easier.

Other methods – maybe a library, technique, or algorithm you’ve used for parsing.

The goal is to make something accurate and scalable, capable of handling code blocks, comments, and multiple languages in the future.
This is part of a university research project, so the focus is not only on making it work, but also on exploring and comparing different approaches.

What techniques have you used for this kind of task?
Would you recommend starting simple with regex, or going straight to something more structured like a lexer/tokenizer?

If you have any practical code examples, they would be very welcome...

Thanks in advance!

SMaag · Post by **SMaag** » Wed Aug 13, 2025 9:15 am

My question is: what do you think is the most efficient approach for this kind of parsing?

Regex – fast for simple patterns but can become messy with more complex syntax.
Lexer + Parser – reading character by character, generating tokens, and then interpreting them via syntax analysis.
AST (Abstract Syntax Tree) – building a syntax tree to make conversion easier.
Other methods – maybe a library, technique, or algorithm you’ve used for parsing.

I guess you will need all of them!

If you really want to convert complex structures you need compiler techniques.

Here is a very good and simple description of how to compile

https://keleshev.com/compiling-to-assem ... m-scratch/

But for university research projects you should study the Oberon Compiler techiniques from Niklaus Wirth, ETH-Zuerich.
He introduced new compiler techniques which today used in all modern programing languages like
Java, Go, Swift and for Microsoft Net. I do not really know the detailed differences to older classic compilers.
But Niklaus Wirth said over Sun Microsystems: After they studied the Oberon Compliler they where able to build Java.
"Java is Oberon with a C-Syntax". But Java never reached the original (Oberon)!

To see what Oberon can do. Test the BlackBox Framework (called Object Pascal) - that's Oberon!

1 important thing I remember is: Niklaus Wirth said in one of his videos: "You will get exactly the character position where an Syntax error is".
This is a big difference to a simple message Syntax Error in Line xy. That's exactly what you need for Code Converters.

SMaag · Post by **SMaag** » Wed Aug 13, 2025 9:44 am

Here the Link to the Project Oberon Book form Niklaus Wirth, Jürgen Gutknecht of ETH Zuerich.

https://people.inf.ethz.ch/wirth/ProjectOberon1992.pdf

The compiler description with Code start at page 266 (12. The Compiler)

AZJIO · Post by **AZJIO** » Wed Aug 13, 2025 2:58 pm

Basically you need to write a compiler, since you don't just parse the syntax, you also need to split it into code sections. I think it's better to use character by character reading for the syntax parser, so you can easily modify the parser behavior during the reading process.
You need to do what Fred does when he converts PureBasic code to ASM or C-Backend.

Post by **idle** » Thu Aug 14, 2025 4:52 am

what are the languages would be a good start.
There are numerous frameworks that can make it easier, check out the Frameworks section in this repo
https://github.com/milahu/awesome-transpilers

SMaag · Post by **SMaag** » Thu Aug 14, 2025 9:54 am

Basically you need to write a compiler, since you don't just parse the syntax, you also need to split it into code sections. I think it's better to use character by character reading for the syntax parser, so you can easily modify the parser behavior during the reading process.
You need to do what Fred does when he converts PureBasic code to ASM or C-Backend.

exactly!

But remember: it is not possible to convert any language to each other. A good example are C++ classes. You can't convert to PB or Pascal, because PB and Pascal do not support that kind of structures. And if you can covert with individual code structures, it is a one way ticket. You can't convert it back. Like if you compile PB to ASM. You loose the original structure. It needed decades to develop decompilers from machine code to C, like the Ghidra decompiler (it is written in Java). https://github.com/NationalSecurityAgency/ghidra

So the first thing would be to do a research of Language structures and find out what is possible to convert easy.
Like VB6 -> Pascal or Delphi it's the easy way. But Pascal to VB6 is not possible in many cases. PureBasic -> Pascal: possible but Pascal -> Purebasic not!

So before thinking about the parser and converter do a structure research.

Until today there is the consens that an "Universal Intermediate Language" do not exist an can not exist!

Piero · Post by **Piero** » Thu Aug 14, 2025 10:17 am

SMaag wrote: Thu Aug 14, 2025 9:54 amBut remember: it is not possible to convert any language to each other.

[sarc]Huh? There's AI for that![/sarc]

Edit/PS: and quantum computers

SMaag · Post by **SMaag** » Thu Aug 14, 2025 12:57 pm

Huh? There's AI for that!

Try to convert VB6 or Fortran to PureBasic with AI!
This will change your mind!

Piero · Post by **Piero** » Thu Aug 14, 2025 5:27 pm

SMaag wrote: Thu Aug 14, 2025 12:57 pmTry to convert VB6 or Fortran to PureBasic with AI!
This will change your mind!

I wouldn't be so sure; will it censor at least 'pineapple pizza' if I ask it in Italian?

Skipper · Post by **Skipper** » Fri Aug 15, 2025 12:25 pm

- first, you write a tokenizer that takes the source file and creates an orderly token stream array from it.
- then, you write a parser that takes the token stream array and converts it into an AST
- with this AST, you do semantic analysis, detect syntax errors, do optimisations, etc. The output is a revised/corrected AST.
- then you write a target language code emitter, that takes the corrected AST from the previous step and emits source code for your target language.

Quite an undertaking - good luck with your project...

Skipper

skinkairewalker · Post by **skinkairewalker** » Sat Aug 16, 2025 5:17 am

awesome guys

I am very grateful for everyone sharing their experiences, I will use them all, thank you very much!!

#NULL · Post by **#NULL** » Sat Aug 16, 2025 1:31 pm

You might want to look into this as a starting point for a tokenizer:
Lexer for PB 4: viewtopic.php?t=22116
It's specifically for PB code, so not general.
In the mentioned german forum thread, the is a more up-to-date version.

Sicro · Post by **Sicro** » Sat Aug 16, 2025 4:23 pm

For my parsing tasks, I always use a lexer/parser combination. For the lexer, I use my own RegEx engine (see my post signature below, look also in the example directory on the project page), which is very flexible, does not require backtracking, and generates a very fast DFA. For the parser, I always write a recursive descent parser (see my post signature below, look in the parser directory on the CodeArchiv project page), because this allows it to react flexibly to different situations.

If you also perform lexing in the parser (scannerless parser), the parser can become quite complex. With a separate lexer, some of the complexity can be outsourced from the parser, which can also reduce the amount of backtracking required by the parser, making it faster. For context-sensitive lexing, several separate lexers can be used, between which the parser can switch during processing.

If you want to translate from any programming language to any other programming language, I think it would be a good idea to come up with a unified intermediate language or abstract syntax tree and then translate each programming language into this form first. This way, in the next step, you can always translate from the unified form into the target programming language without having to keep the source programming language in mind.

You should also ask yourself whether the translated code should still be easy for the programmer to read and understand.

It is also important to bear in mind that the translated code will probably not take advantage of all the benefits of the target programming language, but will instead be code that uses probably most of the time the basic features.

skinkairewalker · Post by **skinkairewalker** » Thu Sep 25, 2025 11:58 pm

Sicro wrote: Sat Aug 16, 2025 4:23 pm For my parsing tasks, I always use a lexer/parser combination. For the lexer, I use my own RegEx engine (see my post signature below, look also in the example directory on the project page), which is very flexible, does not require backtracking, and generates a very fast DFA. For the parser, I always write a recursive descent parser (see my post signature below, look in the parser directory on the CodeArchiv project page), because this allows it to react flexibly to different situations.

If you also perform lexing in the parser (scannerless parser), the parser can become quite complex. With a separate lexer, some of the complexity can be outsourced from the parser, which can also reduce the amount of backtracking required by the parser, making it faster. For context-sensitive lexing, several separate lexers can be used, between which the parser can switch during processing.

If you want to translate from any programming language to any other programming language, I think it would be a good idea to come up with a unified intermediate language or abstract syntax tree and then translate each programming language into this form first. This way, in the next step, you can always translate from the unified form into the target programming language without having to keep the source programming language in mind.

You should also ask yourself whether the translated code should still be easy for the programmer to read and understand.

It is also important to bear in mind that the translated code will probably not take advantage of all the benefits of the target programming language, but will instead be code that uses probably most of the time the basic features.

your regex engine is insane, it will definitely be of extraordinary help !!!
I'm currently having a lot of problems with the parser logic, I had help from Skipper with the tokenizer and I managed to replicate it, but the Parser and lexer analyzer part is the worst xD

skywalk · Post by **skywalk** » Fri Sep 26, 2025 12:46 am

You can take a look at how SQLite parses SQL queries into C code.
There is a grammar file to define your target. The output is C.

I wrote a VB6 -> PB converter many years ago when I switched to PB.
It was a great way to learn PB.
I opted out of regex, and used custom string functions for 95% of the translation.
The 5% got special comments so I could manually decide how to proceed in PB.
Ex.
ZZ;FIX; Some translation conflict code here
I also wrote a C header -> PB code in similar fashion.
This caught most simple headers and conflicts and conditional compiler switches had to be manually edited.

PureBasic Forums - English

Best approach to create a syntax reader for a code converter?

Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?

Re: Best approach to create a syntax reader for a code converter?