Kanji Dictionary in SQLite

pdwyer · Post by **pdwyer** » Mon Oct 01, 2007 5:23 am

True!

Actually, I had an idea for an easy way to convert the DB!

If you do a kanji lookup for "%" it dumps the whole DB to the listbox (108000+ rows), I could compile another copy of the app with a dump feature then filter the listbox rows out to csv via MultiByteToWideChar. Then I could just import the CSV back to SQLite! ( I have a tool for that already)

That could take some of the annoyance (effort

) out of converting.

I'd have to rewrite this proc though after looking up the unicode boundries otherwise the correct search won't run...

Code: Select all

Procedure.l GetTextType(SearchText.s) 

    FirstChar.l = Asc(Left(searchtext,1)) 
    
    If firstchar < 129 
        ProcedureReturn  #English 
    ElseIf firstchar > 135 
        ProcedureReturn #Kanji 
    Else 
        ProcedureReturn #Hiragana 
    EndIf    

EndProcedure

mskuma · Post by **mskuma** » Mon Oct 01, 2007 8:41 am

The easiest way to convert the original edict file is to bring it into a text editor & export it as unicode & then use ts-soft's routines (or the odbc) driver to create, populate & use the database.

Yes, you can use the unicode char boundaries to easily determine the type of character (i.e. ascii vs half- or full- width kana or kanji).

pdwyer · Post by **pdwyer** » Mon Oct 01, 2007 11:45 pm

The original is pain to work with since it's EUC and the column boundries are different in the first part of the file where there are only katakana words etc. Since I have a clean sjis DB (for me anyway) it's easier to dump from there.

Don't suppose you know of the top of your head where hiragana and kanatana end and kanji starts in UTF 16? Save me looking it up

mskuma · Post by **mskuma** » Tue Oct 02, 2007 12:14 am

pdwyer wrote:Don't suppose you know of the top of your head where hiragana and kanatana end and kanji starts in UTF 16? Save me looking it up

You're lazy

Hiragana/katakana: 0x3041 to 0x30FE
Kanji: 0x4E00 to 0xFA2D

pdwyer · Post by **pdwyer** » Wed Oct 03, 2007 1:36 pm

No, just have a cold

Okay, the UTF-8 DB is done and is here
http://www.dwyer-family.net/download/utf8-Kanji.zip

I tried the app with it in unicode but it doesn't work. The error is on the SQL statement which I really don't get as PB unicode compile is supposed to be UTF8 and SQLite SQL is supposed to be UTF8. even if the lookup is on an english word the error is the same. perhaps it's compiled utf16... bit tired to think about it now. I thought I'd see if you mind was fresher and you could try it

select kanji, reading, meaning from kanji where kanji like '核%' order by kanji
SQLError:near "s": syntax error

pdwyer · Post by **pdwyer** » Wed Oct 03, 2007 1:41 pm

hang on, I re read the docs, UTF8 is used by PB for text files in unicode mode so I guess its in utf16 for APIs and internal strings the I'm sending a UFT16 string to the SQLite engine

let me have a play with it

--- 10 mins later

Adding this line

Code: Select all

WideCharToMultiByte_(#CP_UTF8,0,@SQL,-1,@UTF8SQL,Len(UTF8SQL),0,0)

And I get a new error,

"no such table: Kanji"

hmmmm, the sql looks better now, the error shows it's reading the words fully but... no dice.

Time to go to bed, The DB is okay I think though, if I use it in non unicode mode the selects work they just don't display correctly.

Hey, question, do you know what chcp code page for dos is to display utf8? with 932 I can display kanji in dos but I've never tried it with unicode... any idea?

mskuma · Post by **mskuma** » Wed Oct 03, 2007 1:54 pm

pdwyer wrote:No, just have a cold

Yeah me too.

pdwyer wrote:PB unicode compile is supposed to be UTF8 and SQLite SQL is supposed to be UTF8

I think PB's unicode is UTF16 (at least that's the default string-handling encoding). SQlite can be either UTF8 or 16. I think you might have missed my point earlier.. have a close look at ts-soft's sqlite lib, and you'll see the brilliance that it is.. for dealing with unicode & non-unicode situations. I've also advocated since PB's unicode is basically favouring UTF16, it makes perfect sense to setup a sqlite db (or any db when in PB unicode) in UTF16 also. It's the path of least resistance (and least pain) though I guess you could play with UTF8 conversions but its seems to me to be a lot of hassle. So in a nutshell, my suggestion is make a UTF16 db (it's actually more efficient than UTF8 for the storage of Japanese chars anyway) and connect to it using sqlite's 16-series of functions, or use/research ts-soft's source. This is what I've been doing and it's going great for me.

pdwyer · Post by **pdwyer** » Wed Oct 03, 2007 1:58 pm

we're crossing

pdwyer · Post by **pdwyer** » Wed Oct 03, 2007 2:01 pm

I couldn't build a quick and dirty utf16 db, I can build the flat file but the inport freebee tool I have needs ascii in the header line to inport so I went utf8

Rook Zimbabwe · Post by **Rook Zimbabwe** » Thu Oct 04, 2007 3:36 am

I couldn't build a quick and dirty utf16 db, I can build the flat file but the inport freebee tool I have needs ascii in the header line to inport so I went utf8

Is the above in Kanji...

I mean it looks like english... I don't get it. What si UTF16 / UTF/8??? 8 bit / 16 bit [something]???

pdwyer · Post by **pdwyer** » Thu Oct 04, 2007 5:53 am

That's part of the issue you see. Some codecs like UTF16 are full double byte (or wide) so even the english text is modified. UTF 8 and sjis have the base ascii part the same (single byte) then go DBCS when the first byte is above 127 (to put it simply).

One of the challenges when dealing with these things is that there's a lot of conversion going on that you don't see. if you copy text out of a unicode app and paste it into a non unicode app it's translated. If you copy kanji out of a unicode app and paste in an older style japanese app then it will change to shiftjis with completely different byte values even though none of the programs involved wrote any logic to do this.

In translating these things, it always tends to look like you are working with the same data but you aren't. Input is one thing, the app type or compilation is another, the database engine is another and the data in the database is yet another (sometime OS is yet another). If these things don't match then the data in the search form the user doesn't match the data in the database and nothing (correct) gets returned.

Generally you try to keep it as simple as possible, choose a standard and stick with it from start to finish.

For me personally, I've dealt with unicode functions a bit but I used to use Powerbasic which has no unicode compile support so I've not really experienced that. on the other hand I use japanese OS's a lot and sjis is more established standard still even though unicode is getting bigger.

For this project I'm happy with sjis but for learning purposes of using unicode with PB with sqlite I'm still playing with it. It'd be good to get all the bits working cleanly so I can refer to this code in the future when the need pops up. If possible I don't want to use any multibytewidechar API stuff (which I suppose means UTF16, which means getting another import tool... problem is the import tool needs to be a UTF 16 working sqlite pb app which is what isn't working

catch 22)

pdwyer · Post by **pdwyer** » Thu Oct 04, 2007 10:15 am

mskuma wrote:have a close look at ts-soft's sqlite lib, and you'll see the brilliance that it is.. for dealing with unicode & non-unicode situations.

On this, while I'm sure its a good lib, (different to the opensource one?) I don't want to wrap sqlite. I especially don't want to wrap it in a way that makes me call that gettable command. Stubborn I might be but that's gonna bite people one day if they need to do something big and was a warapper itself created as a workaround.

The API for sqlite is not like using the API for ODBC3, there's not that much you need to do.