Forum

Posted by
Pontus Östlund  -  July 2010
Howdy folks!

Say I have a JSON string where our Swedish charcaters has been replaced with unicode like "\u00e5" etc. Now the string it self is ASCII. I want to replace all "\\uXXXX" with the proper swedish character  and this is the solution I came up with. But the question is: Is there a better way to do this?

[PIKE]
string decode_unicode_chars(string s)
{
  sscanf(s, "%{%*s\\u%4[0-9a-fA-F]%}", array m);
  mapping used = ([]);
  foreach (m[*][0], string hex) {
    if ( used[hex] ) continue;
    sscanf(hex, "%x", int char);
    s = replace(s, "\\u"+hex, sprintf("%c", char));
    used[hex] = 1;
  }

  return s;
}

decode_unicode_chars("L\\u00f6k p\\u00e5 laxen");
//> Lök på laxen
[/PIKE]

So, can I do this in a better manner?
 
Posted by
Martin Stjernholm  -  July 2010
Pike uses the same escapes in its string literals, so you could use the pike compiler:

> compile("constant s = \"L\\u00f6k p\\u00e5 laxen\";")->s;
(1) Result: "L\366k p\345 laxen"

(My hilfe escapes 8-bit chars on output, but our proud Swedish chars are really verbatim in the result there.)

Since the compiler has a bit of startup overhead I'm not really sure it's faster for short strings, though.

I was first going to suggest the %O flag to sscanf, but unfortunately it turns out it doesn't support \u escapes (it ought to, I think).

Btw, speaking of JSON, you might be interested to know that Pike 7.8 recently got a Standards.JSON module that sports both encoder and decoder. At this time you need the cvs version to get it, though.
 
Posted by
Pontus Östlund  -  July 2010
Thanks Martin!

I don't expect the JSON strings to be humongous so I might stick with my solution since it seems to work, or I'll make my JSON parser handle u-escapes directly.

Yes, I saw that a JSON parser is on it's way into Pike. That's nice :)
I'll wait for it to become officially bundled and stick with my Pike solution until then.
 
1
Search this thread: