Best approach to create a syntax reader for a code converter?

Just starting out? Need help? Post your questions and find answers here.
User avatar
skinkairewalker
Enthusiast
Enthusiast
Posts: 778
Joined: Fri Dec 04, 2015 9:26 pm

Best approach to create a syntax reader for a code converter?

Post by skinkairewalker »

Hi everyone,

I’m working on an academic project where I need to build a code converter between different programming languages.
The first step is to create a syntax reader to identify structures, variables, functions, types, etc.

My question is: what do you think is the most efficient approach for this kind of parsing?
  • Regex – fast for simple patterns but can become messy with more complex syntax.
  • Lexer + Parser – reading character by character, generating tokens, and then interpreting them via syntax analysis.
  • AST (Abstract Syntax Tree) – building a syntax tree to make conversion easier.
  • Other methods – maybe a library, technique, or algorithm you’ve used for parsing.
The goal is to make something accurate and scalable, capable of handling code blocks, comments, and multiple languages in the future.
This is part of a university research project, so the focus is not only on making it work, but also on exploring and comparing different approaches.

What techniques have you used for this kind of task?
Would you recommend starting simple with regex, or going straight to something more structured like a lexer/tokenizer?

If you have any practical code examples, they would be very welcome...

Thanks in advance!
SMaag
Enthusiast
Enthusiast
Posts: 316
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Best approach to create a syntax reader for a code converter?

Post by SMaag »

My question is: what do you think is the most efficient approach for this kind of parsing?

Regex – fast for simple patterns but can become messy with more complex syntax.
Lexer + Parser – reading character by character, generating tokens, and then interpreting them via syntax analysis.
AST (Abstract Syntax Tree) – building a syntax tree to make conversion easier.
Other methods – maybe a library, technique, or algorithm you’ve used for parsing.
I guess you will need all of them!

If you really want to convert complex structures you need compiler techniques.

Here is a very good and simple description of how to compile

https://keleshev.com/compiling-to-assem ... m-scratch/


But for university research projects you should study the Oberon Compiler techiniques from Niklaus Wirth, ETH-Zuerich.
He introduced new compiler techniques which today used in all modern programing languages like
Java, Go, Swift and for Microsoft Net. I do not really know the detailed differences to older classic compilers.
But Niklaus Wirth said over Sun Microsystems: After they studied the Oberon Compliler they where able to build Java.
"Java is Oberon with a C-Syntax". But Java never reached the original (Oberon)!

To see what Oberon can do. Test the BlackBox Framework (called Object Pascal) - that's Oberon!

1 important thing I remember is: Niklaus Wirth said in one of his videos: "You will get exactly the character position where an Syntax error is".
This is a big difference to a simple message Syntax Error in Line xy. That's exactly what you need for Code Converters.
SMaag
Enthusiast
Enthusiast
Posts: 316
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Best approach to create a syntax reader for a code converter?

Post by SMaag »

Here the Link to the Project Oberon Book form Niklaus Wirth, Jürgen Gutknecht of ETH Zuerich.

https://people.inf.ethz.ch/wirth/ProjectOberon1992.pdf

The compiler description with Code start at page 266 (12. The Compiler)
AZJIO
Addict
Addict
Posts: 2152
Joined: Sun May 14, 2017 1:48 am

Re: Best approach to create a syntax reader for a code converter?

Post by AZJIO »

Basically you need to write a compiler, since you don't just parse the syntax, you also need to split it into code sections. I think it's better to use character by character reading for the syntax parser, so you can easily modify the parser behavior during the reading process.
You need to do what Fred does when he converts PureBasic code to ASM or C-Backend.
User avatar
idle
Always Here
Always Here
Posts: 5855
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Best approach to create a syntax reader for a code converter?

Post by idle »

what are the languages would be a good start.
There are numerous frameworks that can make it easier, check out the Frameworks section in this repo
https://github.com/milahu/awesome-transpilers
SMaag
Enthusiast
Enthusiast
Posts: 316
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Best approach to create a syntax reader for a code converter?

Post by SMaag »

Basically you need to write a compiler, since you don't just parse the syntax, you also need to split it into code sections. I think it's better to use character by character reading for the syntax parser, so you can easily modify the parser behavior during the reading process.
You need to do what Fred does when he converts PureBasic code to ASM or C-Backend.
exactly!

But remember: it is not possible to convert any language to each other. A good example are C++ classes. You can't convert to PB or Pascal, because PB and Pascal do not support that kind of structures. And if you can covert with individual code structures, it is a one way ticket. You can't convert it back. Like if you compile PB to ASM. You loose the original structure. It needed decades to develop decompilers from machine code to C, like the Ghidra decompiler (it is written in Java). https://github.com/NationalSecurityAgency/ghidra

So the first thing would be to do a research of Language structures and find out what is possible to convert easy.
Like VB6 -> Pascal or Delphi it's the easy way. But Pascal to VB6 is not possible in many cases. PureBasic -> Pascal: possible but Pascal -> Purebasic not!

So before thinking about the parser and converter do a structure research.

Until today there is the consens that an "Universal Intermediate Language" do not exist an can not exist!
User avatar
Piero
Addict
Addict
Posts: 884
Joined: Sat Apr 29, 2023 6:04 pm
Location: Italy

Re: Best approach to create a syntax reader for a code converter?

Post by Piero »

SMaag wrote: Thu Aug 14, 2025 9:54 amBut remember: it is not possible to convert any language to each other.
[sarc]Huh? There's AI for that![/sarc]

Edit/PS: and quantum computers :mrgreen:
SMaag
Enthusiast
Enthusiast
Posts: 316
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Best approach to create a syntax reader for a code converter?

Post by SMaag »

Huh? There's AI for that!
Try to convert VB6 or Fortran to PureBasic with AI!
This will change your mind!
User avatar
Piero
Addict
Addict
Posts: 884
Joined: Sat Apr 29, 2023 6:04 pm
Location: Italy

Re: Best approach to create a syntax reader for a code converter?

Post by Piero »

SMaag wrote: Thu Aug 14, 2025 12:57 pmTry to convert VB6 or Fortran to PureBasic with AI!
This will change your mind!
I wouldn't be so sure; will it censor at least 'pineapple pizza' if I ask it in Italian?
User avatar
Skipper
User
User
Posts: 41
Joined: Thu Dec 19, 2024 1:26 pm
Location: NW-Europe

Re: Best approach to create a syntax reader for a code converter?

Post by Skipper »

- first, you write a tokenizer that takes the source file and creates an orderly token stream array from it.
- then, you write a parser that takes the token stream array and converts it into an AST
- with this AST, you do semantic analysis, detect syntax errors, do optimisations, etc. The output is a revised/corrected AST.
- then you write a target language code emitter, that takes the corrected AST from the previous step and emits source code for your target language.

Quite an undertaking - good luck with your project...

Skipper
User avatar
skinkairewalker
Enthusiast
Enthusiast
Posts: 778
Joined: Fri Dec 04, 2015 9:26 pm

Re: Best approach to create a syntax reader for a code converter?

Post by skinkairewalker »

awesome guys :D

I am very grateful for everyone sharing their experiences, I will use them all, thank you very much!!
#NULL
Addict
Addict
Posts: 1498
Joined: Thu Aug 30, 2007 11:54 pm
Location: right here

Re: Best approach to create a syntax reader for a code converter?

Post by #NULL »

You might want to look into this as a starting point for a tokenizer:
Lexer for PB 4: viewtopic.php?t=22116
It's specifically for PB code, so not general.
In the mentioned german forum thread, the is a more up-to-date version.
User avatar
Sicro
Enthusiast
Enthusiast
Posts: 560
Joined: Wed Jun 25, 2014 5:25 pm
Location: Germany
Contact:

Re: Best approach to create a syntax reader for a code converter?

Post by Sicro »

For my parsing tasks, I always use a lexer/parser combination. For the lexer, I use my own RegEx engine (see my post signature below, look also in the example directory on the project page), which is very flexible, does not require backtracking, and generates a very fast DFA. For the parser, I always write a recursive descent parser (see my post signature below, look in the parser directory on the CodeArchiv project page), because this allows it to react flexibly to different situations.

If you also perform lexing in the parser (scannerless parser), the parser can become quite complex. With a separate lexer, some of the complexity can be outsourced from the parser, which can also reduce the amount of backtracking required by the parser, making it faster. For context-sensitive parsing, several separate lexers can be used, between which the parser can switch during processing.

If you want to translate from any programming language to any other programming language, I think it would be a good idea to come up with a unified intermediate language or abstract syntax tree and then translate each programming language into this form first. This way, in the next step, you can always translate from the unified form into the target programming language without having to keep the source programming language in mind.

You should also ask yourself whether the translated code should still be easy for the programmer to read and understand.

It is also important to bear in mind that the translated code will probably not take advantage of all the benefits of the target programming language, but will instead be code that uses probably most of the time the basic features.
Image
Why OpenSource should have a license :: PB-CodeArchiv-Rebirth :: Pleasant-Dark (syntax color scheme) :: RegEx-Engine (compiles RegExes to NFA/DFA)
Manjaro Xfce x64 (Main system) :: Windows 10 Home (VirtualBox) :: Newest PureBasic version
Post Reply