Decoding mysteries of encodings
2009-07-08, 17:25 by Jonas Wallden
By now I'm sure no-one has missed that we've finally shipped Roxen CMS 5.0! This posting is not specifically about the new version but I'll definitely return to that subject in future articles. Today I'll instead bring up a topic that can be a great source of confusion, namely character encodings. After reading this you'll hopefully have a clearer understanding of how Roxen works and I'll share a couple of suggestions along the way that should let you avoid the most common pitfalls.
Roxen smartness to the rescue
Since the beginning we've built Roxen on a runtime environment that handles Unicode strings natively. This has helped us support mixed character sets with no additional effort when rendering graphics, performing XML/XSLT transformations and so on.
However, the real trouble begins when you want to communicate with the outside world – including any form of HTTP traffic (requests and responses) and file system accesss to mention two areas. HTTP standards initially didn't cover Unicode support which made browser vendors come up with various ad-hoc fixes to overcome the limitations. Over time the situation has improved but there are still old browsers in use as well as compatibility issues and bugs which complicate things.
For a long time Roxen has tried to isolate the necessary character handling into its HTTP protocol implementation. If all data sources that contribute to the outgoing web page know how to deliver Unicode strings internally the HTTP module will at last minute dynamically decide on a suitable encoding. If the string can be represented in ISO-8859-1 (Latin 1) this has been the primary choice; otherwise the output has been encoded as UTF-8. Individual file system modules or other data sources can override the automatically inferred encoding by requesting a specific encoding. In all cases the outgoing data gets a suitable HTTP content-type header that announces the chosen encoding. As a web developer you could simply mix and match data and it would normally work just fine.
Not so smart...
Nonetheless, at this point we run into a tricky issue: setting a HTTP content-type header isn't always a perfect way of telling the client about the data representation. Various types of output have their own internal content-type directives such as <meta> tags in HTML and the encoding attribute in an <?xml?> declaration.
For instance, there is a risk that a data source inside Roxen delivers XML data that declares itself to be UTF-8 while the HTTP protocol layer identifies that the string can in fact fit in the narrower and more compact ISO-8859-1 and subsequently switches representation and introduces a conflicting HTTP content-type header. Now what should the client do?
Browser bugs, or at least unexpected behavior, is another concern that I hinted at earlier. For example, I have experienced that frames rendered in Firefox will inherit character encoding from the surrounding frameset unless the former are accompanied by explicit encoding directives. Contributing to this problem is that prior to version 5.0 Roxen equalled "text/html; charset=iso-8859-1" with "text/html" and thus didn't bother spelling it out in long form. However, after this discovery we've expanded the header to avoid the frameset issue even though the old practice wasn't technically a violation of specs.
What you can do
There are several ways to get back on track with Unicode in Roxen, and the simplest one is <charset out="utf-8" />. This RXML tag tells the HTTP protocol layer to select UTF-8 as the output encoding for current request. Since UTF-8 can represent all Unicode characters it will never be automatically promoted to something else before the result is delivered. (In comparison, selecting e.g. ISO-8859-15 wouldn't be as safe since it doesn't have the same characteristics as UTF-8 in that respect.)
With that directive in place, the next step is to make sure you don't introduce other conflicting directives in the data. Either remove or sync any <?xml?> declaration to match with this. One easy way to create one is to use <maketag> in RXML so taken together you'd get something like this:
<charset out="utf-8" />
<maketag type="pi" name="xml">version="1.0" encoding="utf-8"</maketag>
To remove an auto-generated <?xml?> declaration in XSLT transformations you add <xsl:output> to your template:
<xsl:output omit-xml-declaration="yes" />
While on the subject of browser quirks it's appropriate to point out that web pages with <form> elements deserve special attention. If the page itself doesn't contain Unicode content some older browsers will have trouble posting form data that uses Unicode. It's common wisdom that forcing the page to be UTF-8 will improve the chances of form data surviving the trip back to the server unharmed. Additionally, Roxen's decoding of form posts can be assisted by introducing a hidden magic variable in the page; this variable contains a known collection of Unicode characters and by determining exactly how it was damaged in the delivery/post roundtrip Roxen can apply compensating measures and hopefully recover the original Unicode data. The magic variable is inserted with this tag anywhere inside the <form> element:
Improvements in Roxen CMS 5.0
Aside from the more detailed HTTP content-type header mentioned earlier, this version allows XSLT templates to use the <xsl:output> statement to automatically perform the same action that <charset out="..." /> does. This means that unless you prefix your output with more data during RXML processing the output of the XSLT transformation can include an <?xml?> declaration with valid encoding that Roxen's HTTP protocol module obeys. For instance:
<xsl:output media-type="application/rss+xml" encoding="utf-8" />
will generate the following at the beginning of the page:
<?xml version="1.0" encoding="utf-8"?>
together with the HTTP header:
Content-Type: application/rss+xml; charset=utf-8
Another aspect that we've tried to correct in Roxen CMS 5.0 is the previous versions' bad habit to put <?xml version="1.0"?> headers in XML and XSL files without a corresponding encoding attribute. According to the XML specification a lack of encoding implies UTF-8. Strictly speaking there is an exception for applications to interpret it differently if necessary, but if such a file ever escapes the Roxen environment this interpretation is of course hard to enforce.
One problematic situation in particular was editing via Roxen Application Launcher. Starting in CMS 5.0 we try to add a truthful <?xml?> declaration as soon as possible to help external applications handle the source files correctly. Consequently, it's now very important that you don't enter conflicting data in your files since Roxen sniffs for the <?xml?> declaration during saves and uses it to decode the uploaded file. In the same way we also look for BOMs (Byte-Order Marks) when saving plain text content to support data that isn't XML-based. (Note though that BOMs are currently never generated by Roxen internally, just recognized when provided by external editors.)
I should also add that the web-based source code editor (i.e. the <textarea> input field) always works in Unicode. In fact, you can now trigger an internal recoding of data in the repository just by editing the <?xml?> declaration in the first line of the file. Such a recoding action is non-destructive since it will substitute entities if the chosen encoding is too limited.
A related change is that newly created CMS repositories now default to UTF-8 for all language forks. It's a long-standing restriction that once taken into use a particular language encoding cannot be changed retroactively for a repository. By switching the preferred encoding now we are better prepared for future improvements in this area.
Just the beginning
The transition to Unicode has merely started and because of all different software components involved (applications, file systems, databases etc) it will likely be a bumpy ride for many years. Still, the direction is clear and with Roxen CMS 5.0 we take yet another important step, not only as presented here but also by including a Unicode-capable MySQL server in the package.