Page 2 of 2
Posted: Mon Oct 01, 2007 5:23 am
by pdwyer
True!
Actually, I had an idea for an easy way to convert the DB!
If you do a kanji lookup for "%" it dumps the whole DB to the listbox (108000+ rows), I could compile another copy of the app with a dump feature then filter the listbox rows out to csv via MultiByteToWideChar. Then I could just import the CSV back to SQLite! ( I have a tool for that already)
That could take some of the annoyance (effort

) out of converting.
I'd have to rewrite this proc though after looking up the unicode boundries otherwise the correct search won't run...
Code: Select all
Procedure.l GetTextType(SearchText.s)
FirstChar.l = Asc(Left(searchtext,1))
If firstchar < 129
ProcedureReturn #English
ElseIf firstchar > 135
ProcedureReturn #Kanji
Else
ProcedureReturn #Hiragana
EndIf
EndProcedure
Posted: Mon Oct 01, 2007 8:41 am
by mskuma
The easiest way to convert the original edict file is to bring it into a text editor & export it as unicode & then use ts-soft's routines (or the odbc) driver to create, populate & use the database.
Yes, you can use the unicode char boundaries to easily determine the type of character (i.e. ascii vs half- or full- width kana or kanji).
Posted: Mon Oct 01, 2007 11:45 pm
by pdwyer
The original is pain to work with since it's EUC and the column boundries are different in the first part of the file where there are only katakana words etc. Since I have a clean sjis DB (for me anyway) it's easier to dump from there.
Don't suppose you know of the top of your head where hiragana and kanatana end and kanji starts in UTF 16? Save me looking it up

Posted: Tue Oct 02, 2007 12:14 am
by mskuma
pdwyer wrote:Don't suppose you know of the top of your head where hiragana and kanatana end and kanji starts in UTF 16? Save me looking it up

You're lazy
Hiragana/katakana: 0x3041 to 0x30FE
Kanji: 0x4E00 to 0xFA2D
Posted: Wed Oct 03, 2007 1:36 pm
by pdwyer
No, just have a cold
Okay, the UTF-8 DB is done and is here
http://www.dwyer-family.net/download/utf8-Kanji.zip
I tried the app with it in unicode but it doesn't work. The error is on the SQL statement which I really don't get as PB unicode compile is supposed to be UTF8 and SQLite SQL is supposed to be UTF8. even if the lookup is on an english word the error is the same. perhaps it's compiled utf16... bit tired to think about it now. I thought I'd see if you mind was fresher and you could try it
select kanji, reading, meaning from kanji where kanji like '核%' order by kanji
SQLError:near "s": syntax error
Posted: Wed Oct 03, 2007 1:41 pm
by pdwyer
hang on, I re read the docs, UTF8 is used by PB for text files in unicode mode so I guess its in utf16 for APIs and internal strings the I'm sending a UFT16 string to the SQLite engine
let me have a play with it
--- 10 mins later
Adding this line
Code: Select all
WideCharToMultiByte_(#CP_UTF8,0,@SQL,-1,@UTF8SQL,Len(UTF8SQL),0,0)
And I get a new error,
"no such table: Kanji"
hmmmm, the sql looks better now, the error shows it's reading the words fully but... no dice.
Time to go to bed, The DB is okay I think though, if I use it in non unicode mode the selects work they just don't display correctly.
Hey, question, do you know what chcp code page for dos is to display utf8? with 932 I can display kanji in dos but I've never tried it with unicode... any idea?
Posted: Wed Oct 03, 2007 1:54 pm
by mskuma
pdwyer wrote:No, just have a cold

Yeah me too.
pdwyer wrote:PB unicode compile is supposed to be UTF8 and SQLite SQL is supposed to be UTF8
I think PB's unicode is UTF16 (at least that's the default string-handling encoding). SQlite can be either UTF8 or 16. I think you might have missed my point earlier.. have a close look at ts-soft's sqlite lib, and you'll see the brilliance that it is.. for dealing with unicode & non-unicode situations. I've also advocated since PB's unicode is basically favouring UTF16, it makes perfect sense to setup a sqlite db (or any db when in PB unicode) in UTF16 also. It's the path of least resistance (and least pain) though I guess you could play with UTF8 conversions but its seems to me to be a lot of hassle. So in a nutshell, my suggestion is make a UTF16 db (it's actually more efficient than UTF8 for the storage of Japanese chars anyway) and connect to it using sqlite's 16-series of functions, or use/research ts-soft's source. This is what I've been doing and it's going great for me.
Posted: Wed Oct 03, 2007 1:58 pm
by pdwyer
we're crossing

Posted: Wed Oct 03, 2007 2:01 pm
by pdwyer
I couldn't build a quick and dirty utf16 db, I can build the flat file but the inport freebee tool I have needs ascii in the header line to inport so I went utf8
Posted: Thu Oct 04, 2007 3:36 am
by Rook Zimbabwe
I couldn't build a quick and dirty utf16 db, I can build the flat file but the inport freebee tool I have needs ascii in the header line to inport so I went utf8
Is the above in Kanji...

I mean it looks like english... I don't get it. What si UTF16 / UTF/8??? 8 bit / 16 bit [something]???
Posted: Thu Oct 04, 2007 5:53 am
by pdwyer
That's part of the issue you see. Some codecs like UTF16 are full double byte (or wide) so even the english text is modified. UTF 8 and sjis have the base ascii part the same (single byte) then go DBCS when the first byte is above 127 (to put it simply).
One of the challenges when dealing with these things is that there's a lot of conversion going on that you don't see. if you copy text out of a unicode app and paste it into a non unicode app it's translated. If you copy kanji out of a unicode app and paste in an older style japanese app then it will change to shiftjis with completely different byte values even though none of the programs involved wrote any logic to do this.
In translating these things, it always tends to look like you are working with the same data but you aren't. Input is one thing, the app type or compilation is another, the database engine is another and the data in the database is yet another (sometime OS is yet another). If these things don't match then the data in the search form the user doesn't match the data in the database and nothing (correct) gets returned.
Generally you try to keep it as simple as possible, choose a standard and stick with it from start to finish.
For me personally, I've dealt with unicode functions a bit but I used to use Powerbasic which has no unicode compile support so I've not really experienced that. on the other hand I use japanese OS's a lot and sjis is more established standard still even though unicode is getting bigger.
For this project I'm happy with sjis but for learning purposes of using unicode with PB with sqlite I'm still playing with it. It'd be good to get all the bits working cleanly so I can refer to this code in the future when the need pops up. If possible I don't want to use any multibytewidechar API stuff (which I suppose means UTF16, which means getting another import tool... problem is the import tool needs to be a UTF 16 working sqlite pb app which is what isn't working

catch 22)
Posted: Thu Oct 04, 2007 10:15 am
by pdwyer
mskuma wrote:have a close look at ts-soft's sqlite lib, and you'll see the brilliance that it is.. for dealing with unicode & non-unicode situations.
On this, while I'm sure its a good lib, (different to the opensource one?) I don't want to wrap sqlite. I especially don't want to wrap it in a way that makes me call that gettable command. Stubborn I might be but that's gonna bite people one day if they need to do something big and was a warapper itself created as a workaround.
The API for sqlite is not like using the API for ODBC3, there's not that much you need to do.