Roxen CMS and XHTML

2008-09-01, 11:10 by Jonas Wallden in Development

Welcome!

First, a warm welcome to all readers of our new Planet web site! I and my colleagues will try to blog regularly on topics that relate to Roxen and the web in general. I've already collected some ideas on tutorials and tricks covering RXML features and CMS development that I want to talk about here.

XHTML is just like HTML but better...?

In this first posting I'd like to begin with a subject that our support team gets asked about quite frequently in recent time, namely how to approach XHTML for web use.

Before we get to XHTML, let's look at what we already have: XML and HTML. From the dawn of the web HTML has been the language that web pages are written in, and even though today's version (HTML 4.01) is far more advanced compared to what Tim Berners-Lee created the same general concepts that were true back then still hold. For instance, the same web page can be expressed in many different ways in HTML since the actual markup allows for implicit end tags, case-insensitive tag and attribute names and so on. Combine this with old and new web browsers that aren't particularly strict in reporting errors but instead try their best to recover and show at least something useful rather than an error dialog. Over time the situation has improved steadily (thanks to DOCTYPE declarations that trigger standards compliance) but nevertheless new browsers must emulate bugs and duplicate behavior of older versions that people have come to rely on.

Now, with XML the web developer community has been given a chance to start with a clean slate. Aside from the fact that XML enables you to separate content from layout (you define your own tags etc) it also removes the parser ambiguities that HTML suffers from. XML tools are in fact required to abort processing if they encounter malformed input instead of covering up somebody's mistake and wander off into the woods...

So where does XHTML come into this? It's basically HTML expressed in XML syntax. It sounds simple but has great benefits: XHTML can be created and processed with any XML tool since it is genuine XML. It inherits all the advantages that XML has (no more sloppy treatment of errors and so on) while maintaining its purpose as a web page language that looks familiar to any web developer. Your <h1> or <p> tags will still be called <h1> and <p> in XHTML.

So we'll all switch to XHTML, right?

That would be a natural conclusion given what I wrote above, but we still have the browsers to consider. Even today's most popular browsers (Microsoft Explorer 6 and 7) have no knowledge at all of rendering pure XHTML content. Safari and Firefox are not much better.

This was anticipated by the authors of the W3C XHTML 1.0 specification. In its first version they allowed for a compromise where XHTML can still be served as text/html so that browsers would see it as old-style HTML and try their best to render the pages. XHTML version 1.1 requires that application/xhtml+xml is used instead but that's of even less use as long as browsers aren't aware of it.

The practice of serving XHTML to a browser that only understands HTML is the real reason why developers end up in trouble. The browser may seemingly handle simple pages so the developer incorrectly assumes that it's an acceptable solution also for more complex web sites. Wrong! Pages will break in subtle, or not so subtle, ways. The first sign for Roxen CMS sites is normally that you get tags with empty content (such as <br> or <img>) misinterpreted. You can't blame Roxen for serving XHTML in accordance to the XML spec where this syntax is valid; if you asked for XML, be prepared that you really will get XML.

Fixes and "fixes"

If you've followed along all the way here you also deserve some practical hints!

The primary solution remains to generate HTML. <xsl:output> with a method attribute is all you need in your server-side XSLT stylesheets. Adding a DOCTYPE is a natural next step which ensure standards-compliant parsing and avoid quirks mode that many browsers otherwise fall back to. Here's an example on how to accomplish this:

  <xsl:output
          method="html"
          doctype-public="-//W3C//DTD HTML 4.01 Transitional//EN"
          doctype-system="http://www.w3.org/TR/html4/loose.dtd" />

When this isn't an option you have to accept that browsers will be more or less successful in displaying your XHTML pages. Most important is to avoid empty elements:

  • Roxen CMS will close empty elements as <br />. Note the extra space before the slash which should help most browsers detect this as a regular <br> tag.
  • Tags that can be either empty or contain data are more tricky. One example is <script> which may reference an external resource or have inlined code. The former variant needs a closing tag which Roxen CMS will condense into <script /> unless you add some trivial markup inside. One that we've found works well in this and other places is <span />. If you write <script src="..."><span /></script> you will preserve the ending tag even in XHTML.
  • Watch out for <div /> and <ul /> and similar container tags that end up empty. This will easily break the page structure since browsers only see them as opening tags that need to be balanced further down. If that happens your DOM tree and CSS patterns will most likely be wrong. If you generate such containers dynamically you can either check prior to outputting the container or use the <span /> placeholder as mentioned above:

    <xsl:variable name="some-li-nodes">
      <xsl:call-template name="..." />
    </xsl:variable>
    <xsl:if test="$some-li-nodes">
      <ul><xsl:copy-of select="$some-li-nodes" /></ul>
    </xsl:if>

But why can't Roxen just output what I enter in my files?

Definitely a valid question! If you enter <script></script>, why does Roxen CMS have to change this into <script />? Very simple: it has no memory of what you entered. When the XSLT engine starts processing XML data it's converted into an internal tree form. In this representation there is no semantic difference between the two forms so the distinction that was apparent in the input file is lost. It's simply a waste of memory and processing power to store and manage such information when it serves no purpose.

I hope this has clarified concerns about why XHTML in Roxen may not work as you initially expect. I look forward to your comments.

– Jonas Walldén, CTO

 

You need to log in to post comments.

 

1   Paul Kok

2008-10-20 13:47

"Watch out for <div /> and <ul />"

This is a very important part, as we found out. The suggested option ("check prior to outputting") is nice of course, but if you still want an 'empty' <div> in your code (for example, because you use it as a divider) put some 'empty' content in it like: <div>&shy;</div>. The &shy; character is displayed as nothing in most browsers.

Also note that the default 4.5 site contains some of these 'bugs'! For example the <xsl:template name="component-spacing"/> and the <xsl:template name="roxen-edit-box"/> You have to overwrite these to get the default site working.

Succes!

Nov 25, 2017

Categories

Community Update (1)
Customers (0)
Development (10)
New sites (1)

Latest comments

"Watch out for <div /> and <ul />" This is a very important part, as we found out. The suggested option ("check prior to outputting") is nice of course, but if you still want an 'empty' <div> in your code (for example, because you use it as a divider) put some 'empty' content in it like: <div>&shy;</div>. The &shy; character is displayed as nothing in most browsers. Also note that the default 4.5 site contains some of these 'bugs'! For example the <xsl:template name="component-spacing"/> and the <xsl:template name="roxen-edit-box"/> You have to overwrite these to get the default site working. Succes!