- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
Writing a UTF library in scala
Mon, 2010-12-27, 00:24
I'm having trouble deciding how to design a Unicode library to handle invalid UTF. Therefore, please give me your thoughts.
Converting individual code points:========================== Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to a UTF-32 code point. Options include:
Case 2: Attempting to convert a code point from UTF-32 to UTF-8 (or UTF-16). Same options:
Converting strings code points:========================== Case 3: Decoding a Utf8String (or Utf16String) to Utf32String:
Converting individual code points:========================== Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to a UTF-32 code point. Options include:
a) having return type Option[Utf32Char]
b) having return type Utf32Char and returning the substitution character "�" on error, or
c) having return type Utf32Char and throwing an IllegalArgumentException on error.Multiple options could be implemented.
Case 2: Attempting to convert a code point from UTF-32 to UTF-8 (or UTF-16). Same options:
a) having return type Option[Array[Utf8Char]]
b) having return type Array[Utf8Char] and returning the encoding for the substitution character "�" on error, or
c) having return type Array[Utf8Char] and throwing an IllegalArgumentException on error.
Converting strings code points:========================== Case 3: Decoding a Utf8String (or Utf16String) to Utf32String:
a) having return type Option[Utf32String]
b) having return type Utf32String and substituting "�" for each invalid code unit in the input, or
c) having return type Utf32String and throwing an IllegalArgumentException on error.Case 4: Encoding a Utf32String as Utf8String (or Utf16String):
a) having return type Option[Utf8String]
b) having return type Utf8String and substituting "�" for each invalid code unit in the input, or
c) having return type Utf8String and throwing an IllegalArgumentException on error.Thanks,Arya
Mon, 2010-12-27, 04:57
#2
Re: Writing a UTF library in scala
Conversion code already exists in Java; see, e.g.,
http://download.oracle.com/javase/6/docs/api/java/nio/charset/CharsetDec...
The Java methods can {ignore,replace,report} {malformed input,
ummappable characters}. Reports result in CharacterCodingException's
of the appropriate type. ISTM that, for interop, we should be using
Java's methods. What to do -- ignore, replace, or report -- for
malformed input and for unmappable characters can be passed into Scala
conversion methods with default arguments. The question is, what
should the default be. Since ignore and replace are silent, I think
the default should be an exception, and let the caller specify any
explicit intent to ignore or replace.
On Sun, Dec 26, 2010 at 3:23 PM, Arya Irani wrote:
> I'm having trouble deciding how to design a Unicode library to handle
> invalid UTF. Therefore, please give me your thoughts.
>
>
> Converting individual code points:
> ==========================
> Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to a
> UTF-32 code point. Options include:
>
> a) having return type Option[Utf32Char]
>
> b) having return type Utf32Char and returning the substitution character "�"
> on error, or
>
> c) having return type Utf32Char and throwing an IllegalArgumentException on
> error.
>
> Multiple options could be implemented.
>
>
> Case 2: Attempting to convert a code point from UTF-32 to UTF-8 (or
> UTF-16). Same options:
>
> a) having return type Option[Array[Utf8Char]]
>
> b) having return type Array[Utf8Char] and returning the encoding for the
> substitution character "�" on error, or
>
> c) having return type Array[Utf8Char] and throwing an
> IllegalArgumentException on error.
>
> Converting strings code points:
> ==========================
> Case 3: Decoding a Utf8String (or Utf16String) to Utf32String:
>
> a) having return type Option[Utf32String]
>
> b) having return type Utf32String and substituting "�" for each invalid code
> unit in the input, or
>
> c) having return type Utf32String and throwing an IllegalArgumentException
> on error.
>
> Case 4: Encoding a Utf32String as Utf8String (or Utf16String):
>
> a) having return type Option[Utf8String]
>
> b) having return type Utf8String and substituting "�" for each invalid code
> unit in the input, or
>
> c) having return type Utf8String and throwing an IllegalArgumentException on
> error.
>
> Thanks,
> Arya
Mon, 2010-12-27, 05:27
#3
Re: Writing a UTF library in scala
Oh hey, cool.
But... java.nio.charset allows us to convert between:
to/from java.nio.CharBuffer (or to Array[Char], *if* your instance of CharBuffer supports that *optional* method...).
You could subclass java.nio.charset.Charset to support UTF-32BE, UTF-32LE, and UTF-32 w/ BOM, but you still wouldn't be able to convert between UTF-8 and UTF-32 without first converting to/from UTF-16 each time. (Something I was trying to get away from.)
I like the design of java.nio.charset (new to me), but it does seem to be directed at serialization, rather than in-memory, algorithmic stuff. (That's also probably why there's no support for UTF-32 in that package.)
-Arya
On Dec 26, 2010, at 10:47 PM, Jim Balter wrote:
But... java.nio.charset allows us to convert between:
US-ASCII aka ISO646-US,ISO-8859-1 aka ISO-LATIN-1,UTF-8, UTF-16BE,UTF-16LE,UTF-16 w/ BOM
to/from java.nio.CharBuffer (or to Array[Char], *if* your instance of CharBuffer supports that *optional* method...).
You could subclass java.nio.charset.Charset to support UTF-32BE, UTF-32LE, and UTF-32 w/ BOM, but you still wouldn't be able to convert between UTF-8 and UTF-32 without first converting to/from UTF-16 each time. (Something I was trying to get away from.)
I like the design of java.nio.charset (new to me), but it does seem to be directed at serialization, rather than in-memory, algorithmic stuff. (That's also probably why there's no support for UTF-32 in that package.)
-Arya
On Dec 26, 2010, at 10:47 PM, Jim Balter wrote:
Conversion code already exists in Java; see, e.g.,
http://download.oracle.com/javase/6/docs/api/java/nio/charset/CharsetDecoder.html#decode%28java.nio.ByteBuffer%29
The Java methods can {ignore,replace,report} {malformed input,
ummappable characters}. Reports result in CharacterCodingException's
of the appropriate type. ISTM that, for interop, we should be using
Java's methods. What to do -- ignore, replace, or report -- for
malformed input and for unmappable characters can be passed into Scala
conversion methods with default arguments. The question is, what
should the default be. Since ignore and replace are silent, I think
the default should be an exception, and let the caller specify any
explicit intent to ignore or replace.
On Sun, Dec 26, 2010 at 3:23 PM, Arya Irani <arya.irani@gmail.com> wrote:I'm having trouble deciding how to design a Unicode library to handleinvalid UTF. Therefore, please give me your thoughts.Converting individual code points:==========================Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to aUTF-32 code point. Options include:a) having return type Option[Utf32Char]b) having return type Utf32Char and returning the substitution character "�"on error, orc) having return type Utf32Char and throwing an IllegalArgumentException onerror.Multiple options could be implemented.Case 2: Attempting to convert a code point from UTF-32 to UTF-8 (orUTF-16). Same options:a) having return type Option[Array[Utf8Char]]b) having return type Array[Utf8Char] and returning the encoding for thesubstitution character "�" on error, orc) having return type Array[Utf8Char] and throwing anIllegalArgumentException on error.Converting strings code points:==========================Case 3: Decoding a Utf8String (or Utf16String) to Utf32String:a) having return type Option[Utf32String]b) having return type Utf32String and substituting "�" for each invalid codeunit in the input, orc) having return type Utf32String and throwing an IllegalArgumentExceptionon error.Case 4: Encoding a Utf32String as Utf8String (or Utf16String):a) having return type Option[Utf8String]b) having return type Utf8String and substituting "�" for each invalid codeunit in the input, orc) having return type Utf8String and throwing an IllegalArgumentException onerror.Thanks,Arya
Mon, 2010-12-27, 13:07
#4
Re: Writing a UTF library in scala
Yes. On the other hand, Java doesn't have Option, and it's emphasis in using exceptions to return information is not appreciated by all (to put it mildly). A Scala library cannot choose between a return type Option[UTF8Char] and UTF8Char based on the value of a parameter.
On Mon, Dec 27, 2010 at 01:47, Jim Balter <Jim@balter.name> wrote:
--
Daniel C. Sobral
I travel to the future all the time.
On Mon, Dec 27, 2010 at 01:47, Jim Balter <Jim@balter.name> wrote:
Conversion code already exists in Java; see, e.g.,
http://download.oracle.com/javase/6/docs/api/java/nio/charset/CharsetDecoder.html#decode%28java.nio.ByteBuffer%29
The Java methods can {ignore,replace,report} {malformed input,
ummappable characters}. Reports result in CharacterCodingException's
of the appropriate type. ISTM that, for interop, we should be using
Java's methods. What to do -- ignore, replace, or report -- for
malformed input and for unmappable characters can be passed into Scala
conversion methods with default arguments. The question is, what
should the default be. Since ignore and replace are silent, I think
the default should be an exception, and let the caller specify any
explicit intent to ignore or replace.
On Sun, Dec 26, 2010 at 3:23 PM, Arya Irani <arya.irani@gmail.com> wrote:
> I'm having trouble deciding how to design a Unicode library to handle
> invalid UTF. Therefore, please give me your thoughts.
>
>
> Converting individual code points:
> ==========================
> Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to a
> UTF-32 code point. Options include:
>
> a) having return type Option[Utf32Char]
>
> b) having return type Utf32Char and returning the substitution character "�"
> on error, or
>
> c) having return type Utf32Char and throwing an IllegalArgumentException on
> error.
>
> Multiple options could be implemented.
>
>
> Case 2: Attempting to convert a code point from UTF-32 to UTF-8 (or
> UTF-16). Same options:
>
> a) having return type Option[Array[Utf8Char]]
>
> b) having return type Array[Utf8Char] and returning the encoding for the
> substitution character "�" on error, or
>
> c) having return type Array[Utf8Char] and throwing an
> IllegalArgumentException on error.
>
> Converting strings code points:
> ==========================
> Case 3: Decoding a Utf8String (or Utf16String) to Utf32String:
>
> a) having return type Option[Utf32String]
>
> b) having return type Utf32String and substituting "�" for each invalid code
> unit in the input, or
>
> c) having return type Utf32String and throwing an IllegalArgumentException
> on error.
>
> Case 4: Encoding a Utf32String as Utf8String (or Utf16String):
>
> a) having return type Option[Utf8String]
>
> b) having return type Utf8String and substituting "�" for each invalid code
> unit in the input, or
>
> c) having return type Utf8String and throwing an IllegalArgumentException on
> error.
>
> Thanks,
> Arya
--
Daniel C. Sobral
I travel to the future all the time.
Mon, 2010-12-27, 18:57
#5
Re: Writing a UTF library in scala
On Sun, Dec 26, 2010 at 6:23 PM, Arya Irani <arya.irani@gmail.com> wrote:
Usually, you won't do this expecting it to fail. Thus, you should plan for it to not fail. Thus, it should be an exception--full performance unless something goes wrong, which it usually won't. (Unless you try to read random binary data into UTF-32.)
Same deal.
Same deal.
--Rex
I'm having trouble deciding how to design a Unicode library to handle invalid UTF. Therefore, please give me your thoughts.
Converting individual code points:========================== Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to a UTF-32 code point. Options include:a) having return type Option[Utf32Char]b) having return type Utf32Char and returning the substitution character "�" on error, orc) having return type Utf32Char and throwing an IllegalArgumentException on error.
Usually, you won't do this expecting it to fail. Thus, you should plan for it to not fail. Thus, it should be an exception--full performance unless something goes wrong, which it usually won't. (Unless you try to read random binary data into UTF-32.)
Case 2: Attempting to convert a code point from UTF-32 to UTF-8 (or UTF-16). Same options:
Same deal.
Converting strings code points:========================== Case 3: Decoding a Utf8String (or Utf16String) to Utf32String:
Case 4: Encoding a Utf32String as Utf8String (or Utf16String):
Same deal.
--Rex
Tue, 2010-12-28, 08:27
#6
Re: Writing a UTF library in scala
I put up some under-tested and under-documented code at https://github.com/refried/scala-utf/ and will be updating it periodically. It doesn't implement LinearSeqOptimized yet, let alone ropes.. Also it's my first time using git or github, so I may not be doing it right.
Feedback would be welcomed from anyone!
Thanks,Arya
Feedback would be welcomed from anyone!
Thanks,Arya
On Mon, Dec 27, 2010 at 10:23 AM, Arya Irani wrote:
> Converting individual code points:
> ==========================
> Case 1: Attempting to convert a single code point from UTF-8 or UTF-16 to a
> UTF-32 code point. Options include:
>
> a) having return type Option[Utf32Char]
>
> b) having return type Utf32Char and returning the substitution character "�"
> on error, or
>
> c) having return type Utf32Char and throwing an IllegalArgumentException on
> error.
>
> Multiple options could be implemented.
The old "Option vs Exception" question. Whether to model the failure
case inside your app domain (using Option) or outside (with an
Exception) is a context-sensitive design choice. It depends on the
probability of failure (thus, data dependent), the apps desired
"quality-to-cost" trade-off (eg a quick script vs an aircraft control
system), and whether a failure can actually be handled within the app.
I'd offer both (a) and (c) and let your users decide which they
prefer. Or if you only offer one, use Option, as pulling on None will
yield a runtime exception anyway.
-Ben