The journey continues: PostgreSQL and Unicode in VAST 2024

I wasn’t planning on writing a second installment on this little journey this soon. But instead of going back to my normal work I wanted to go the extra mile and see if and how I can add a UnicodeString to the Database from VAST.

Of course I found myself struggling with writing a UncodeString. Well, actually I seem to be struggling in creating a UnicodeString with UTF-8 in a workspace. I seem to be overseeing the obvious, but somehow whatever i try to create and write is ending up as ugly garbage in the database….

So let’s start with my workspace (using VAST 13 on Windows 10 x64):

| conn logonSpec conSpec newRow table |

conSpec := AbtDatabaseConnectionSpec
forDbmClass: #AbtPostgreSQLDatabaseManager
databaseName: 'testdb'.


logonSpec := AbtDatabaseLogonSpec id: 'postgres' password: 'pw' server: 'server-ip:5432'.
conn := conSpec connectUsingAlias: 'TestConSpec'
logonSpec: logonSpec.
conn enableUnicode.


table := conn openTableNamed: 'text'.
newRow := table emptyRow.
newRow at: #id put: 5.
newRow at: #text put: ('Zürich ist übrigens nicht häßlich sondern sehr schön' asUnicodeString ).
table addRow: newRow.

conn disconnect.


| conn logonSpec conSpec qSpec result resultCollection newRow table |
resultCollection := OrderedCollection new.

I tried a few variants of creating an appropriate String for insertion, but all I end up seeing in psql on the server is this (please not that row #1 was inserted on that linux box using psql, the others were inserted trying different ways of creating a UnicodeString).

The problem is: I am not even sure I ever created a valid utf-8 String, because they also show up broken in a Workspace. This all feels like trying to build a card house on a layer of marbles lying on the trunk of a car at 120 mph. What is right, what is wrong?

Inspecting

‘Zürich ist übrigens nicht häßlich sondern sehr schön’  asUnicodeString

yields

‘Zürich ist übrigens nicht häßlich sondern sehr schön’

So I guess the problem is not the PostgreSQL driver. The current codepage in my image is 1252. Inspecting the parameter to UnicodeString class>>utf8: looks like this:

‘Zürich ist übrigens nicht häßlich sondern sehr schön’

and I guess this is the main problem here. The conversion from codepage 1252 to utf-8 before creating a UnicodeString fails already (I guess…).

Then, there is this interesting question: the Workspace is configured to be UTF-8 (the menu “Encoding” has a checkmark at the “UTF-8′ entry, the other 3 options are unchecked.

So is my code wrong or is my Workspace configured incorrectly?

… ah, yes, that’s it: I need to change the Workspace setting to be ANSI, which converts the String to a broken representation, so I have to correct the String back to contain Umlauts. Inspecting the result now works….

So now I try to insert ths String to the Database.

Et voila:

This of course brings up an interesting bundle of question I need to solve.
First, I need to find a combination of Workspace/Inspector setup that allows for handling Strings of different origin: the imagr or tha database (and later: Seaside or API hanlders).

Another question is: what about text constants in my source code that are now in the manager library. We do a lot of nextPutAll: with German Strings that contain Umlauts. Since my images on Windows and Linux always were configured to use 1252 or iso-8859, these Strings will of course be in that encoding in the manager library.

I don’t want to sound too pessimistic, but I am afraid I am up to a lot of even more fun with Umlauts in the months to come (Marten, Michal, Peter: I hear you laughing out loud!)…

6 responses to “The journey continues: PostgreSQL and Unicode in VAST 2024”

  1. Henrik Johansen Avatar
    Henrik Johansen

    If you want something that works in both modes, and ensures the argument is a UnicodeString, you can use
    newRow at: #text put: (UnicodeString escaped: ‘Zu{FC}rich ist u{FC}brigens nicht hu{E4}u{DF}lich sondern sehr schu{F6}n’)

    In the next release, the workspace will work as you expect, and
    newRow at: #text put: ‘Zürich ist übrigens nicht häßlich sondern sehr schön’
    will be enough.
    If you inspect/debug the expression, the literal will be a String if the ACP contains its code points, and a UnicodeString if not.
    Either way, the database layer will treat both correctly.

    “Another question is: what about text constants in my source code that are now in the manager library. We do a lot of nextPutAll: with German Strings that contain Umlauts. Since my images on Windows and Linux always were configured to use 1252 or iso-8859, these Strings will of course be in that encoding in the manager library.”
    If things go according to plan, they will still be that way – we intend to use a flag on each method in the manager to indicate whether the source is stored as ANSI or UTF8. A checkbox in the method editor will reflect this, and let you enable Unicode mode – existing source will then get converted.
    The other usual restrictions still apply, so this means in Unicode mode, you can have Unicode string literals, but not selectors/inst vars/class names.

    Like

    1. Joachim Avatar

      Henrik,

      this all sounds wonderful. So I guess I just have to wait for all my problems and struggles to go away 😉
      I like the idea of having control of the encoding of source code.
      I think we don’t really need unicode for syntactic elements like selectors, though. Having method selectors consisting of emojis may sound cool, but it’s probably not what the world has been looking for (grey beards like me even have no idea how to type them on a PC keyboard …). So ASCII only selectors are fine.
      I was about to suggest a one-time conversion of a manager library to utf-8, until I remembered that people may be accessing a library with multiple VAST versions in parallel (or even just for archeological reasons like restoring an older version of a program for some legal requirements)…
      It’s good to know you guys have backwards compatibility in mind.

      Like

  2. emaringolo Avatar

    Joachim:
    If your Workspace is set to UTF-8 you can do:
    newRow at: #text put: (‘Zürich ist übrigens nicht häßlich sondern sehr schön’ utf8AsUnicodeString ). “This will treat the bytes in the workspace as UTF-8 bytes”

    Or if your Workspace is set to ANSI then you can do:
    newRow at: #text put: (‘Zürich ist übrigens nicht häßlich sondern sehr schön’ asUnicodeString ). “This will go through the active codepage converter”

    Like

    1. Joachim Avatar

      Hi Esteban,

      thanks for your (extremely prompt) answer!
      So there will be some getting used to how to configure all that.

      One question remains: what about String literals in source code? What are they encoded in? Will this change over time? Meaning will there be Strings in the Library that will be encoded diferently in future releases? If so, I guess you’ll be providing conversion tools? Because otherwise, things will get eevn more complicated…

      I am stopping here for today. So far I like what I see. I hope to find the time to continue with more interesting things soon. Seaside (web and API clients) and the Library Manager are my two other areas of doubt (not sure how to put that in english) at the moment. I guess I’ll have to do an explicit conversion from ISO-8859-15 to UTF-8 for everything in the source code…?

      Like

      1. Henrik Johansen Avatar
        Henrik Johansen

        A conversion of the entire Library to UTF-8 is something we will consider if the demand is great – the danger is that, if you try to load the code in an image operating with a different ACP – the literal class may be UnicodeString rather than String.
        As we cannot ensure full polymorphism between the two classes, it seems less disruptive to not do so by default, instead letting the user convert/test each affected method at their own discretion.

        Like

        1. Joachim Avatar

          Henrik,

          as I wrote in my other comment, a one-time conversion may not even be a good idea. Chances are people are working on different projects with different VAST versions on the same Library and you might lock older versions out if you convert….

          Like