>>> We will use UTF16 as native encoding. >> >> But we can store the data as UTF-8, right (to keep the size of our >> db's down)? It will be converted to and from UTF-16 by Valentina for >> calculations/sorts/etc. on-the-fly, correct? > > I still not sure, Jon. > > Once I have see some problem with UTF8 storage. > Do not remember now which one. > > When I come to String-based indexes, we will think again on this. > > Yes, dream is to store as UTD8 or some other single byte encoding, > IF developer have told this.
If I may add my grain of salt on this, I will go further: let the user of the database (us) choose. It could be for the whole database, i.e. not necessarily at field level. But as always the more flexibility the better...
Here's why:
UTF-8 takes 1 to 4 bytes UTF-16 takes 2 to 4 bytes UTF-32 takes 4 bytes
Now imagine that you have a field where you allow the user to enter up to 64 characters. Quiz: How much space you have to keep for the field in the database for each of the 3 encodings?
Encoding For Americans For International -------- ------------- -----------------
UTF-8 64 256 UTF-16 128 256 UTF-32 256 256
So, as we can see, as soon as we develop a software for international, we need to keep 4 bytes per character anyway. The only difference will be where the blanks are put: at the end or in the middle of the string.
However, for all comparisons UTF-32 will be faster since there is no conversion of the characters to be done. UTF-8 4 bytes character is more costly for comparisons than UTF-16 4 bytes character which is also more costly than UTF-32. Unfortunately, most of the Unicode implementations support only UTF-8 and UTF-16.
So, for today the ideal is to use UTF-8 for Americans (or French, etc.) and use UTF-16 everywhere else. Once, implementations will be added for UTF-32, it will become useful to switch to it. UTF-32 is the only native encoding. It has once been UTF-16 but it changed lately...
Since it is a new code base, why not start going with the flexibility...
Eric
___________________________________________________________________
Eric Forget Cafederic ForgetE .at. cafederic .DOT com <http://www.cafederic.com/>
Fingerprint <86D5 38F5 E1FD 5D9C 71C3 BAA3 797E 70A4 6210 C684>
_______________________________________________ Valentina mailing list Valentina@lists.macserve.net http://lists.macserve.net/mailman/listinfo/valentina
©2004 Eric Forget |