This page is no longer maintained — Please continue to the home page at www.scala-lang.org

straightening out file.encoding

27 replies
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.

Since I've been wanting to create an easy mechanism for exerting fine
grained control over scala compiler and interpreter behavior (a list of
-X and -Y command line options gets unwieldy in a hurry) I am doing some
much needed cleanup of properties and settings code. One thing this
brought back to my attention is the situation with file.encoding.

Once upon a time, scala used the system file encoding, defaulting to
ISO-8859 if that was unknown:

https://lampsvn.epfl.ch/trac/scala/changeset/4078
[...] System.getProperty("file.encoding", "ISO-8859-1"))

At some point (which my git searching will not reveal for some reason)
this was changed, unintentionally I believe, in such a way as to ignore
the system encoding and always use UTF-8:

// now it looks like this, but props is the properties object
// based on the properties file in the jar, which does not define
// file.encoding, so it always uses UTF8 without checking System.
props.getProperty("file.encoding", "UTF8")

I opened a bug about this a while ago:

https://lampsvn.epfl.ch/trac/scala/ticket/1581

While performing the cleanup I discovered that while the library and
properties objects always use UTF8, partest always uses ISO-8859-1,
which seems unlikely to be intentional.

So, I propose to make the following changes simultaneously.

1) move all properties files into scala/
2) change partest to use the same encoding as everyone else
3) use the local file.encoding if it is present
3b) ... but! Apple's java reports file.encoding as MacRoman[1],
despite the fact that the default encoding on OS X is and
has always been UTF-8[2]. So I propose to add hacky but
unavoidable "ignore MacRoman and use UTF-8 on OS X" logic[3],
but use the system default if it's anything else.
4) If an encoding is passed on the command line with -encoding,
pass it to java with -Dfile.encoding so the JVM actually uses it.
As things stand, scala has to explicitly specify the value of
-encoding everywhere it is used (which is not a bad idea anyway)
but someone using java libs directly will end up using the default
encoding unless they ALSO set JAVA_OPTS=-Dfile.encoding=UTF8.
5) Since 4) requires the dread task of parsing in the startup script,
I should fix this bug as well:

https://lampsvn.epfl.ch/trac/scala/ticket/1222
"scala ignores -Djava.library.path"

I implemented this a while ago actually, passing any -D command
line options along to the java invocation unaltered, but I haven't
pursued committing it because I have no idea whether it'll work on
windows and, if it doesn't, what to do about it.

[1] scala -e 'println(System.getProperty("file.encoding"))' => MacRoman
[2] http://en.wikipedia.org/wiki/Mac_OS_Roman
[3] http://www.blakeramsdell.com/techblog/2006/06/10/unicode-is-tricky-in-ja...

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: straightening out file.encoding

There is perhaps a conspiracy against java users who would like to use
non-ascii charsets without endless pain. I just booted windows to see
what its default is and on Windows XP with sun's latest JVM it's
"Cp1252" which is about as promising a default as MacRoman:

http://en.wikipedia.org/wiki/Windows-1252

So I'm not entirely sure what to do about these suboptimal defaults.
However, I do think these two points are clear:

1) If the user passes -encoding on the command line to scala or
scalac, that encoding should be used everywhere.
2) If the user has explicitly set their system file.encoding to
anything, that encoding should be used, not UTF-8.

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: straightening out file.encoding

On Tue, Mar 10, 2009 at 3:51 PM, Paul Phillips wrote:
> Since I've been wanting to create an easy mechanism for exerting fine
> grained control over scala compiler and interpreter behavior (a list of
> -X and -Y command line options gets unwieldy in a hurry) I am doing some
> much needed cleanup of properties and settings code.  One thing this
> brought back to my attention is the situation with file.encoding.
>
> Once upon a time, scala used the system file encoding, defaulting to
> ISO-8859 if that was unknown:
>
>  https://lampsvn.epfl.ch/trac/scala/changeset/4078
>  [...] System.getProperty("file.encoding", "ISO-8859-1"))
>
> At some point (which my git searching will not reveal for some reason)
> this was changed, unintentionally I believe, in such a way as to ignore
> the system encoding and always use UTF-8:
>
>  // now it looks like this, but props is the properties object
>  // based on the properties file in the jar, which does not define
>  // file.encoding, so it always uses UTF8 without checking System.
>  props.getProperty("file.encoding", "UTF8")
>
> I opened a bug about this a while ago:
>
>  https://lampsvn.epfl.ch/trac/scala/ticket/1581
>
> While performing the cleanup I discovered that while the library and
> properties objects always use UTF8, partest always uses ISO-8859-1,
> which seems unlikely to be intentional.
>
Ah! that probably explains some weird problems I get from partest from
time to time.

> So, I propose to make the following changes simultaneously.
>
>  1) move all properties files into scala/
>  2) change partest to use the same encoding as everyone else
>  3) use the local file.encoding if it is present
>  3b) ... but! Apple's java reports file.encoding as MacRoman[1],
>    despite the fact that the default encoding on OS X is and
>    has always been UTF-8[2].  So I propose to add hacky but
>    unavoidable "ignore MacRoman and use UTF-8 on OS X" logic[3],
>    but use the system default if it's anything else.
>  4) If an encoding is passed on the command line with -encoding,
>    pass it to java with -Dfile.encoding so the JVM actually uses it.
>    As things stand, scala has to explicitly specify the value of
>    -encoding everywhere it is used (which is not a bad idea anyway)
>    but someone using java libs directly will end up using the default
>    encoding unless they ALSO set JAVA_OPTS=-Dfile.encoding=UTF8.
>  5) Since 4) requires the dread task of parsing in the startup script,
>    I should fix this bug as well:
>
>      https://lampsvn.epfl.ch/trac/scala/ticket/1222
>      "scala ignores -Djava.library.path"
>
>    I implemented this a while ago actually, passing any -D command
>    line options along to the java invocation unaltered, but I haven't
>    pursued committing it because I have no idea whether it'll work on
>    windows and, if it doesn't, what to do about it.
>
> [1] scala -e 'println(System.getProperty("file.encoding"))' => MacRoman
> [2] http://en.wikipedia.org/wiki/Mac_OS_Roman
> [3] http://www.blakeramsdell.com/techblog/2006/06/10/unicode-is-tricky-in-ja...
>

It sounds OK to me, but we should see whether we can test this on all
major platforms (Windows, Linux, Mac) before it goes in. Stephane, who
knows much more about encodings than I do, might want to comment also.
He can also test it on Windows.

Cheers

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
default charsets and Source.from behavior

On Mon, Mar 16, 2009 at 12:49:14PM +0100, martin odersky wrote:
> It sounds OK to me, but we should see whether we can test this on all
> major platforms (Windows, Linux, Mac) before it goes in. Stephane, who
> knows much more about encodings than I do, might want to comment also.
> He can also test it on Windows.

[NOTE: I know this is totally not interesting. I couldn't agree more!
But before 2.8 is released is the time to fix this and then hopefully we
won't have to think about it anymore...]

I could still use some more feedback on encodings. I'd like to fix up
scala.io for 2.8 to at least be internally consistent, but this requires
a few decisions to be made.

A new issue is the one implicitly brought up in #1883:

https://lampsvn.epfl.ch/trac/scala/ticket/1883

You will notice iulian closed this because the source data was not
actually UTF-8 encoded, and indeed I discovered the same thing when
writing the NIO out of scala.io. However, I still think the bug is
valid, because to me it seems extremely undesirable that the default
behavior for reading some data via the ostensibly simple scala interface
can include throwing an exception which will kill me at runtime and
which it would never even occur to me to catch. And which might not be
thrown during testing no matter how much data I throw at it.

CharsetDecoders have a configuration method called "onMalformedInput"
which lets us specify what is to be done with an illegal byte sequence.
The default unfortunately is REPORT, which means throw obscure
exception. I would say the whole point of scala.io.Source is to shield
me from having to explicitly configure the charset decoder not to throw
an exception, so I propose to either change this default to REPLACE or
IGNORE, or not have any default and require users to specify it, so at
least they will be forced to be aware of it.

The behavior can still be overridden, and/or the replacement strategy
can be defined in subclasses - for the moment I have this implemented
like this:

// by default we replace bad chars with the decoder's replacement value (e.g. "?")
// this behavior can be altered by overriding these two methods
def malformedAction(): CodingErrorAction = CodingErrorAction.REPLACE
def receivedMalformedInput(e: Exception): Char = decoder.replacement()(0)

decoder.onMalformedInput(this.malformedAction)

The reason I think changing the default is particularly important goes
back to this:

http://thread.gmane.org/gmane.comp.lang.scala.internals/189

I think the route to the most consistent and reasonable behavior is to
use the system file.encoding if it has been set, UNLESS it is the
default as reported by the operating system on macs and windows
(MacRoman and Cp1252, respectively) in which case to use UTF-8. For
sure not using UTF-8 is wrong on the mac; I admit I am much less clear
about windows.

So if UTF-8 is to be the default then I do not think it is at all
acceptable for file I/O to be program-destroyingly unforgiving. If you
enjoy having outright program failure on the slightest deviation from
spec, you can always use XML - that's its specialty.

Messed up fact of the day: apple has apparently modified their JVM to
observe runtime changes to file.encoding and have Charset.defaultCharset
change its return value. The openjdk source says defaultCharset is set
the first time it's called and never changed, and indeed that's how
java7 acts on my mac and how java6 works on windows.

// openjdk implementation
public static Charset defaultCharset() {
if (defaultCharset == null) {
[ lazy init elided]
}
return defaultCharset;
}

And yet:

... (Java HotSpot(TM) Client VM, Java 1.5.0_16)
scala> java.nio.charset.Charset.defaultCharset
res0: java.nio.charset.Charset = MacRoman

scala> System.setProperty("file.encoding", "UTF-8")
res1: java.lang.String = MacRoman

scala> java.nio.charset.Charset.defaultCharset
res2: java.nio.charset.Charset = UTF-8

(Try that on any other JVM and I expect you'll see no change, though who
knows who else has customized it.)

I notice apple's current implementation of java6 acts the same as the
rest of the world, so I think they've decided that wasn't a good idea.

Ignoring apple's attempt at helpfulness, the only way to see consistent
behavior is to pass -Dfile.encoding to the java invocation, setting it
before scala is running. Otherwise there's a major league
non-determinism bug, because if any method anywhere is called which
invokes defaultCharset before the file.encoding property is set (and
there are core java methods which use it) then it will embed the wrong
default for the lifetime of the JVM.

milessabin
Joined: 2008-08-11,
User offline. Last seen 33 weeks 3 days ago.
Re: default charsets and Source.from behavior

As far as file I/O is concerned, I really think we should revert to
doing *exactly* the same thing as Java.

For network I/O there are no sensible defaults because the encoding is
determined by the remote system. I'm 100% in favour of forcing
explicit specification of the encoding and reporting of mismatches
between expected and actual encoding rather than silently translating
or ignoring.

Cheers,

Miles

loverdos
Joined: 2008-11-18,
User offline. Last seen 2 years 27 weeks ago.
Re: default charsets and Source.from behavior
Hi Paul,
The fact that the URL you mention can be parsed as ISO-8859-1 but not as UTF-8 means that it contains some character after ASCII 127. This is perfectly fine. 
For an http request with rich headers, I believe the proper way would be to inspect them and see if a charset can be figured out (basically what browsers do). The page mentioned in the ticket does not give a clue about its encoding, I just did a <wget -S>.
IMHO, the ticket doesnot hold, simply because you are referring to an external source and you cannot force it to some encoding without first consulting any metadata it may have (http headers in this respect).
BRChristos

On Wed, Apr 22, 2009 at 17:58, Paul Phillips <paulp@improving.org> wrote:
On Mon, Mar 16, 2009 at 12:49:14PM +0100, martin odersky wrote:
> It sounds OK to me, but we should see whether we can test this on all
> major platforms (Windows, Linux, Mac) before it goes in. Stephane, who
> knows much more about encodings than I do, might want to comment also.
> He can also test it on Windows.

[NOTE: I know this is totally not interesting.  I couldn't agree more!
But before 2.8 is released is the time to fix this and then hopefully we
won't have to think about it anymore...]

I could still use some more feedback on encodings.  I'd like to fix up
scala.io for 2.8 to at least be internally consistent, but this requires
a few decisions to be made.

A new issue is the one implicitly brought up in #1883:

 https://lampsvn.epfl.ch/trac/scala/ticket/1883

You will notice iulian closed this because the source data was not
actually UTF-8 encoded, and indeed I discovered the same thing when
writing the NIO out of scala.io.  However, I still think the bug is
valid, because to me it seems extremely undesirable that the default
behavior for reading some data via the ostensibly simple scala interface
can include throwing an exception which will kill me at runtime and
which it would never even occur to me to catch.  And which might not be
thrown during testing no matter how much data I throw at it.

CharsetDecoders have a configuration method called "onMalformedInput"
which lets us specify what is to be done with an illegal byte sequence.
The default unfortunately is REPORT, which means throw obscure
exception.  I would say the whole point of scala.io.Source is to shield
me from having to explicitly configure the charset decoder not to throw
an exception, so I propose to either change this default to REPLACE or
IGNORE, or not have any default and require users to specify it, so at
least they will be forced to be aware of it.

The behavior can still be overridden, and/or the replacement strategy
can be defined in subclasses - for the moment I have this implemented
like this:

 // by default we replace bad chars with the decoder's replacement value (e.g. "?")
 // this behavior can be altered by overriding these two methods
 def malformedAction(): CodingErrorAction = CodingErrorAction.REPLACE
 def receivedMalformedInput(e: Exception): Char = decoder.replacement()(0)

 decoder.onMalformedInput(this.malformedAction)

The reason I think changing the default is particularly important goes
back to this:

 http://thread.gmane.org/gmane.comp.lang.scala.internals/189

I think the route to the most consistent and reasonable behavior is to
use the system file.encoding if it has been set, UNLESS it is the
default as reported by the operating system on macs and windows
(MacRoman and Cp1252, respectively) in which case to use UTF-8.  For
sure not using UTF-8 is wrong on the mac; I admit I am much less clear
about windows.

So if UTF-8 is to be the default then I do not think it is at all
acceptable for file I/O to be program-destroyingly unforgiving.  If you
enjoy having outright program failure on the slightest deviation from
spec, you can always use XML - that's its specialty.

Messed up fact of the day: apple has apparently modified their JVM to
observe runtime changes to file.encoding and have Charset.defaultCharset
change its return value.  The openjdk source says defaultCharset is set
the first time it's called and never changed, and indeed that's how
java7 acts on my mac and how java6 works on windows.

   // openjdk implementation
   public static Charset defaultCharset() {
       if (defaultCharset == null) {
           [ lazy init elided]
       }
       return defaultCharset;
   }

And yet:

... (Java HotSpot(TM) Client VM, Java 1.5.0_16)
scala> java.nio.charset.Charset.defaultCharset
res0: java.nio.charset.Charset = MacRoman

scala> System.setProperty("file.encoding", "UTF-8")
res1: java.lang.String = MacRoman

scala> java.nio.charset.Charset.defaultCharset
res2: java.nio.charset.Charset = UTF-8

(Try that on any other JVM and I expect you'll see no change, though who
knows who else has customized it.)

I notice apple's current implementation of java6 acts the same as the
rest of the world, so I think they've decided that wasn't a good idea.

Ignoring apple's attempt at helpfulness, the only way to see consistent
behavior is to pass -Dfile.encoding to the java invocation, setting it
before scala is running.  Otherwise there's a major league
non-determinism bug, because if any method anywhere is called which
invokes defaultCharset before the file.encoding property is set (and
there are core java methods which use it) then it will embed the wrong
default for the lifetime of the JVM.

--
Paul Phillips      | If this is raisin, make toast with it.
Future Perfect     |
Empiricist         |
pp: i haul pills   |----------* http://www.improving.org/paulp/ *----------



--
 __~O
-\ <,       Christos KK Loverdos
(*)/ (*)      http://ckkloverdos.com
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: default charsets and Source.from behavior

On Wed, Apr 22, 2009 at 06:53:30PM +0300, Christos KK Loverdos wrote:
> For an http request with rich headers, I believe the proper way would
> be to inspect them and see if a charset can be figured out (basically
> what browsers do). The page mentioned in the ticket does not give a
> clue about its encoding, I just did a .

Do people think it's sensible to implement some level of encoding
guessing logic, or is this madness? I don't think it'd be unreasonable
since it's apparent to me there are no defaults which work as desired in
all circumstances, and it's already the case that people who process
character data without specifying the encoding are gambling on the
results.

I'm not real keen on it becoming my life's work though, I am only
dealing with encodings at all out of a general desire for robustness.

> IMHO, the ticket doesnot hold, simply because you are referring to an
> external source and you cannot force it to some encoding without first
> consulting any metadata it may have (http headers in this respect).

But any data can be processed as if it were any encoding, if you don't
mind getting a '?' where the invalid bytes reside. The question is more
one of whether one would usually prefer '?'s or runtime exceptions,
especially in the case where one didn't specify an encoding (assuming we
continue to allow that case -- but we can only remove it from scala, we
can't remove it from java...)

milessabin
Joined: 2008-08-11,
User offline. Last seen 33 weeks 3 days ago.
Re: default charsets and Source.from behavior

On Wed, Apr 22, 2009 at 5:34 PM, Paul Phillips wrote:
> Do people think it's sensible to implement some level of encoding
> guessing logic, or is this madness?

It's madness ... trust me, this stuff has been done to death over the
years, and the only sane thing to do is to have explicit labelling
(eg. via HTTP headers).

For an example of the kind of contortions you get into sniffing
encodings even for predictable inputs like XML documents take a look
at,

http://www.w3.org/TR/REC-xml/#sec-guessing

In general you can't assume any particular content type so it's really
completely hopeless.

Cheers,

Miles

Alex Cruise
Joined: 2008-12-17,
User offline. Last seen 2 years 26 weeks ago.
Re: default charsets and Source.from behavior

Paul Phillips wrote:
> Do people think it's sensible to implement some level of encoding
> guessing logic, or is this madness?
Madness! :) (Please, no Sparta comments...)

XML parsers are among the few well-known classes of software that can
reliably guess the encoding of a bitstream, and that's only possible
because of the extreme strictness of the syntax. It was once thought to
be a good idea to include a Byte Order Mark at the beginning of Unicode
text streams, but the industry on the whole found this practice
distasteful, so now it's only widely used in the dusty north-west corner
of the industry.
> I don't think it'd be unreasonable
> since it's apparent to me there are no defaults which work as desired in
> all circumstances, and it's already the case that people who process
> character data without specifying the encoding are gambling on the
> results.
>
> I'm not real keen on it becoming my life's work though, I am only
> dealing with encodings at all out of a general desire for robustness.
>
>
I'm all for robustness in the sense of being liberal in what one
accepts, but lossy decoding should never be the default. It would be a
nice *enhancement* for Scala to support "fuzzy" charset decoding but I
would hate to see it become the default behaviour.

As for all the platform-specific heuristic crap, my recommendation is
"If you don't specify an encoding, you get UTF-8. Period, end of
story. Scala cares not for your platform's unique cultural heritage."

-0xe1a

milessabin
Joined: 2008-08-11,
User offline. Last seen 33 weeks 3 days ago.
Re: default charsets and Source.from behavior

On Wed, Apr 22, 2009 at 5:44 PM, Alex Cruise wrote:
> As for all the platform-specific heuristic crap, my recommendation is "If
> you don't specify an encoding, you get UTF-8.  Period, end of story.  Scala
> cares not for your platform's unique cultural heritage."

Except that it does care about the *Java* platform ... so it should do
exactly what Java does and what every Java application, framework and
library expects. It's also the easiest thing to do where we're reusing
Java I/O libraries.

But, yeah, in theory UTF-8 would be the right way to go.

Cheers,

Miles

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: default charsets and Source.from behavior

On Wed, 2009-04-22 at 17:56 +0100, Miles Sabin wrote:
> On Wed, Apr 22, 2009 at 5:44 PM, Alex Cruise wrote:
> > As for all the platform-specific heuristic crap, my recommendation is "If
> > you don't specify an encoding, you get UTF-8. Period, end of story. Scala
> > cares not for your platform's unique cultural heritage."
>
> Except that it does care about the *Java* platform ... so it should do
> exactly what Java does and what every Java application, framework and
> library expects. It's also the easiest thing to do where we're reusing
> Java I/O libraries.

Just yesterday someone said on IRC that he felt Java I/O was OK once he
forgot the existence of FileInputStreamReader's constructor that doesn't
take an encoding. By implication, this also means forgetting about
FileReader. In my experience, the only people in Java who rely on the
defaults are the ones who have not yet been burned by them and it
doesn't take long to be burned by them if you deploy in more than one
platform.

About the issue of silently accepting invalid data, I agree with the
people who said no. This stuff may seem OK in isolation, but it's
horrible when used in a larger system where you end up with illegal data
in a database and only find out about it months later.

Best,
Ismael

Alex Cruise
Joined: 2008-12-17,
User offline. Last seen 2 years 26 weeks ago.
Re: default charsets and Source.from behavior

Miles Sabin wrote:
> On Wed, Apr 22, 2009 at 5:44 PM, Alex Cruise wrote:
>
>> As for all the platform-specific heuristic crap, my recommendation is "If
>> you don't specify an encoding, you get UTF-8. Period, end of story. Scala
>> cares not for your platform's unique cultural heritage."
>>
> Except that it does care about the *Java* platform ... so it should do
> exactly what Java does and what every Java application, framework and
> library expects. It's also the easiest thing to do where we're reusing
> Java I/O libraries.
>
Luckily scala.io can feel free to publish its own API that's
structurally unrelated to java.io or java.nio and doesn't make any
promises about complying with platform encoding heuristics.

Specifically, I propose that for any construction of a Source in which
the parameter is or describes a resource that's "just bytes," with no
available, trustworthy encoding metadata, and no encoding parameter is
specified, the method should *always* pass "UTF-8" as an *explicit*
encoding parameter to the underlying platform's primitive IO library.

Source.fromURL can probably safely make guesses when the response
includes a Content-Type; however it should be documented that if the
resource at the URL is not actually in its declared encoding (I've seen
this happen in production a few times) you're likely to get character
encoding exceptions down the line unless you pass some hypothetical
"fuzzy" flag (or explicitly ask for ISO8859-1 and take the UTF-8 garbage
like 2/3 of the PHP apps in the world)

-0xe1a

milessabin
Joined: 2008-08-11,
User offline. Last seen 33 weeks 3 days ago.
Re: default charsets and Source.from behavior

On Wed, Apr 22, 2009 at 6:18 PM, Alex Cruise wrote:
> Specifically, I propose that for any construction of a Source in which the
> parameter is or describes a resource that's "just bytes," with no available,
> trustworthy encoding metadata, and no encoding parameter is specified, the
> method should *always* pass "UTF-8" as an *explicit* encoding parameter to
> the underlying platform's primitive IO library.

I disagree. No default at all would be preferable to defaulting to
UTF-8. Using Java'd defaults at least has the merit of being wrong in
the same way as everybody else.

> Source.fromURL can probably safely make guesses when the response includes a
> Content-Type;

Response? This is an arbitrary stream ... you're surely not suggesting
we assume a MIME envelope?!?

Cheers,

Miles

Alex Cruise
Joined: 2008-12-17,
User offline. Last seen 2 years 26 weeks ago.
Re: default charsets and Source.from behavior

Miles Sabin wrote:
> On Wed, Apr 22, 2009 at 6:18 PM, Alex Cruise wrote:
>
>> Specifically, I propose that for any construction of a Source in which the
>> parameter is or describes a resource that's "just bytes," with no available,
>> trustworthy encoding metadata, and no encoding parameter is specified, the
>> method should *always* pass "UTF-8" as an *explicit* encoding parameter to
>> the underlying platform's primitive IO library.
>>
> I disagree. No default at all would be preferable to defaulting to
> UTF-8. Using Java'd defaults at least has the merit of being wrong in
> the same way as everybody else.
>
Well, "whatever Java does" has the advantage of allowing us to pass the
buck, but it also means that we can't write a meaningfully tight
contract for the scala.io API.
>> Source.fromURL can probably safely make guesses when the response includes a
>> Content-Type;
>>
> Response? This is an arbitrary stream ... you're surely not suggesting
> we assume a MIME envelope?!?
>
If the connection happens to be an Http(s)URLConnection you can go
looking for headers; otherwise you use the usual encoding
default/specification/heuristic.

-0xe1a

Iulian Dragos 2
Joined: 2009-02-10,
User offline. Last seen 42 years 45 weeks ago.
Re: default charsets and Source.from behavior

Paul Phillips wrote:
>> external source and you cannot force it to some encoding without first
>> consulting any metadata it may have (http headers in this respect).
>>
> IMHO, the ticket doesnot hold, simply because you are referring to an
>
> But any data can be processed as if it were any encoding, if you don't
> mind getting a '?' where the invalid bytes reside. The question is more
> one of whether one would usually prefer '?'s or runtime exceptions,
> especially in the case where one didn't specify an encoding (assuming we
> continue to allow that case -- but we can only remove it from scala, we
> can't remove it from java...)
>
I sympathize with both views: I agree it's nasty to crash at runtime for
a mismatch in encodings, but we're talking about scala.io.Source, not
just any data. I think the right way to do is crash, since most likely
(if what you're reading is REALLY a source file, hence it has a strict
syntax) you're going to crash later or generate wrong code. I know, I've
seen this happening with scala source files, and I was happy to see the
crash early on.

For arbitrary data, I think Paul's solution is nicer. Nobody is going to
get hurt if some question marks appear inside an email. Unfortunately
nobody has the time to write a proper scala.io library, so I'll refrain
from proposing having another class. Maybe have this 'fuzzyness' level
configurable (and required), as someone else suggested?

iulian

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: default charsets and Source.from behavior

My 3 cents:

1. Yes, we should be constent about encodings, and give the matter
some thought. Thanks to Paul for heaving taken the lead on this.

2. When there's some level of doubt, I'd follow the principle to do it
like Java. That's a meta-principle in the design of Scala. Unless we
really care about some thing, we do it like Java. That way, as Miles
says, we are at least wrong in the same way as everybody else :-)

3. Not knowing anything about encodings, I do not want to be called up
for a decision what to do. So I trust Paul to decide the right thing.

Cheers

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: default charsets and Source.from behavior

I want to address this point directly:

On Thu, Apr 23, 2009 at 11:58:00AM +0200, martin odersky wrote:
> 2. When there's some level of doubt, I'd follow the principle to do it
> like Java. That's a meta-principle in the design of Scala. Unless we
> really care about some thing, we do it like Java. That way, as Miles
> says, we are at least wrong in the same way as everybody else :-)

On Wed, Apr 22, 2009 at 04:13:20PM +0100, Miles Sabin wrote:
> As far as file I/O is concerned, I really think we should revert to
> doing *exactly* the same thing as Java.

I think that the question of whether to do this is settled by something
I had forgotten to include in my already too long summary, which is that
scala has been ignoring the default file encoding for a long time, as
specified here:

https://lampsvn.epfl.ch/trac/scala/ticket/1581

So it would actually be a breaking change for scala to now suddenly
start using MacRoman when no file encoding is specified (as would be
required by "exactly the same as java") because it has been using UTF-8
no matter what for at least several releases. If it were posing a big
problem for scala users on osx and windows that we aren't using MacRoman
and Cp1252, we should have heard about it.

It's worth looking at how jruby very recently faced this same issue:

http://jira.codehaus.org/browse/JRUBY-3576

The Mac JDK uses MacRoman as its default encoding, which means that
UTF-8 strings coming from the filesystem, console, or almost anywhere
else will not print correctly. I'm baffled how this bug has remained in
for so long. Our only option may be to start using a JRuby-specific
variable for default encoding, which uses platform default on most
systems, but UTF-8 when it's MacRoman. This would probably solve all
sorts of issues dealing with strings on Mac.
[...]
I took the path of forcing file.encoding to UTF-8 when uname -s =
"Darwin". I can see no reason why Apple JDK should default to
MacRoman...it's simply wrong.

I had come to the identical conclusion, although I am not yet sure that
setting file.encoding is the best way to address it.

Combining all that with what I think is the majority opinion that in the
absence of a specified and non-platform-default file encoding we should
always use UTF-8 anyway, then that is the logic I intend to follow.

No matter what we do, all bets are going to be off for people who
directly call into java libs without specifying an encoding. So I'll
document that issue, and focus on making the scala I/O routines
internally consistent with sane defaults.

I remain interested in all input as this is a tough place for me to
anticipate everything.

On Thu, Apr 23, 2009 at 11:58:00AM +0200, martin odersky wrote:
> 3. Not knowing anything about encodings, I do not want to be called up
> for a decision what to do. So I trust Paul to decide the right thing.

Brilliantly played! You've got me studying a subject that until recently
held my interest only slightly less well than my office paint drying.

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: default charsets and Source.from behavior

Hi Paul,

Now that I have devolved things to you, you drag me back in :-)

I am not sure that keeping to the status quo is a good idea. First,
ticket 1591 is not that long ago. Second, most people avoid encodings
by sticking to 7 bit ASCII. Third, we can probably live with some
incompatibility for 2.8 -- it won't be the only one.

So my advise would still be: If you do not want the issue to come back
again and again, keep a low profile, and do what everybody else is
doing (i.e. do like Java).

Cheers

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: default charsets and Source.from behavior

On Fri, Apr 24, 2009 at 08:07:19PM +0200, martin odersky wrote:
> I am not sure that keeping to the status quo is a good idea. First,
> ticket 1591 is not that long ago.

The ticket is not, but the behavior is. Here is the timeline:

https://lampsvn.epfl.ch/trac/scala/changeset/11012
05/11/07 15:16:31 (2 years ago)

That is when the bug which underlies #1581 was introduced. However, at
that point instead of always using UTF-8 no matter what, it always used
ISO-8859-1 no matter what.

https://lampsvn.epfl.ch/trac/scala/changeset/13888
02/05/08 11:27:38 (15 months ago)

That changed the default to UTF-8, without touching the underlying bug
(which indeed is still there.)

I should point out right now, because only in the course of trying to
understand all this have the many ways and places encodings are used
begun to distinguish themselves to me, that until r13888 the bug
described by #1581 applied only to the encoding being used by scalac to
read source code files -- the one which you can specify to scala (and
java) with the "-encoding" switch.

Distinctly from this, and the more interesting question with respect
to compatibility, is what encoding to use on methods such as
Source.fromFile if the user specifies no encoding. And in the next
commit, the incorrectly determined util.Properties.encodingString
expanded its reach into Source.fromFile.

https://lampsvn.epfl.ch/trac/scala/changeset/13889
02/05/08 11:39:14 (15 months ago)

def fromFile(file: File): Source =
fromFile(file, util.Properties.encodingString, Source.DefaultBufSize)

So the default encoding for all files opened via scala.io has been UTF-8
for 15 months. This is why I say it would be more disruptive to do what
java does, than it would be to continue as things are.

To further muddy the waters, Source.fromInputStream has been hardcoded
to use utf-8 since at least r11012.

def fromInputStream(is: InputStream): Source =
fromInputStream(is, "utf-8", None)

It is still hardcoded that way in trunk.

> Second, most people avoid encodings by sticking to 7 bit ASCII. Third,
> we can probably live with some incompatibility for 2.8 -- it won't be
> the only one.

It is true that the reason all this hasn't been more of a problem is
that most people most of the time are flying under the 8-bit radar and
all the encodings treat 0-127 the same. However, to me that sounds like
a strong argument to go UTF-8 all the way. If there's one thing we can
conclude from the paucity of bug reports about encodings, it's that
these issues affect a relatively small minority. I think that minority
would be far better served by attempting to improve upon the java
situation and to treat the platforms consistently than by continuing to
propagate java's design deficiencies.

> So my advise would still be: If you do not want the issue to come back
> again and again, keep a low profile, and do what everybody else is
> doing (i.e. do like Java).

Not having the issue come back again and again is definitely my goal! I
just think that the last 15 months of scala usage is at least as good a
source to draw upon as is what typical java apps are doing.

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: default charsets and Source.from behavior

I'm still not convinced. The fact that we could change the global
behavior globally 15 months ago without a user revolt indicates for me
that code that relies on encodings simply was not that common. And
Scala io is really rudimentary and not all that useful.
Looking at the collection libraries and now also Enumeration, I am a
bit shocked how much inconsistent code has accumulated there and I
guess scala.io is no different. That's something we need to do now:
Write useful libraries with a high standard of code quality. So,
again, I would not let compatibility constraints prevent us from doing
the right thing.

So what is the right thing? I agree that utf8 is a nice, universal,
modern standard. But there are really powerful arguments to do the
same as Java here. We are talking about I/O! So files are bound to be
written by one program and read by another. It would be annoying if
Scala and Java used different conventions which then hampered
interoperability.

In the end the question is: Do we want to make it easy to move files
between Java (or other System programs) and Scala, or do we want to
make it easy to move files between Scala programs running on different
systems? My vote is on the former.

Cheers

milessabin
Joined: 2008-08-11,
User offline. Last seen 33 weeks 3 days ago.
Re: default charsets and Source.from behavior

On Sat, Apr 25, 2009 at 9:25 AM, martin odersky wrote:
> In the end the question is: Do we want to make it easy to move files
> between Java (or other System programs) and Scala, or do we want to
> make it easy to move files between Scala programs running on different
> systems? My vote is on the former.

Agreed 200%.

Please note that the idea of defaults here really doesn't make much
sense ... any code which relies on them will be subtly (or not so
subtly) broken in some contexts. So the right thing to do is for users
of the API to specify an encoding explicitly.

So for me this issues boils down to ...

1. Should we support a default encoding at all?

2. If we support a default encoding, should it be consistent with Java.

I think that the answer to (2) is clearly "Yes".

(1) is not so clear, however. Bearing in mind that any reliance on
default encodings is essentially a programming error I have a lot of
sympathy with Ismael's view that we should drop defaults altogether.
On the other hand we really need to be able to support "same as Java"
behaviour for interop.

So one option would be drop defaults, but add an explicit "same as
Java" encoding which could be specified explicitly where people really
want that behaviour.

But that might be design decision which is better left deferred until
we actually start work on the design of a sensible scala.io package.

Cheers,

Miles

Erik Engbrecht
Joined: 2008-12-19,
User offline. Last seen 3 years 18 weeks ago.
Re: default charsets and Source.from behavior
1. Should we support a default encoding at all?

Yes.  Character encodings are one of those things that people tend to not think about until they get burned by it, and even then they are likely to avoid thinking about it because they really don't understand it.  Making someone explicitly set something in the name of correctness when they don't understand what they are setting is rather pointless.
On Mac I think it would be reasonable to detect the default character set being MacRoman and flip it to UTF-8.  I'm not sure about on Windows.  Other than that I think it's best to follow what Java does.

On Sat, Apr 25, 2009 at 5:58 AM, Miles Sabin <miles@milessabin.com> wrote:
On Sat, Apr 25, 2009 at 9:25 AM, martin odersky <martin.odersky@epfl.ch> wrote:
> In the end the question is: Do we want to make it easy to move files
> between Java (or other System programs) and Scala, or do we want to
> make it easy to move files between Scala programs running on different
> systems? My vote is on the former.

Agreed 200%.

Please note that the idea of defaults here really doesn't make much
sense ... any code which relies on them will be subtly (or not so
subtly) broken in some contexts. So the right thing to do is for users
of the API to specify an encoding explicitly.

So for me this issues boils down to ...

1. Should we support a default encoding at all?

2. If we support a default encoding, should it be consistent with Java.

I think that the answer to (2) is clearly "Yes".

(1) is not so clear, however. Bearing in mind that any reliance on
default encodings is essentially a programming error I have a lot of
sympathy with Ismael's view that we should drop defaults altogether.
On the other hand we really need to be able to support "same as Java"
behaviour for interop.

So one option would be drop defaults, but add an explicit "same as
Java" encoding which could be specified explicitly where people really
want that behaviour.

But that might be design decision which is better left deferred until
we actually start work on the design of a sensible scala.io package.

Cheers,


Miles

--
Miles Sabin
tel: +44 (0)7813 944 528
skype:  milessabin
http://twitter.com/milessabin



--
http://erikengbrecht.blogspot.com/
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: default charsets and Source.from behavior

OK, I'm out of arguments. We'll do it like java.

My fallback position was to suggest no defaults for scala, and indeed
miles brought that possibility up.

I haven't looked too closely at the named/default arguments for 2.8.
Are default implicits possible? That's the default I've always wanted.
Can we do something clever with charsets passed as implicit values?
Charsets do fall squarely in the category of parameters I would view as
implicit-appropriate.

In the same vein as "easy things easy and hard things possible", I
personally favor making "good engineering easy and legacy interoperation
possible", and I think we might be going the other way on this one. But
there are good arguments on all sides, that I don't dispute.

On Sat, Apr 25, 2009 at 08:37:47AM -0400, Erik Engbrecht wrote:
> On Mac I think it would be reasonable to detect the default character
> set being MacRoman and flip it to UTF-8.

This is the one move I'm going to realllly miss if we're determined to
do things just like java. Everyone and their brother knows java is
wrong here. UTF-8 has been the default charset on the mac since OS X
came out, so we would be defaulting to a long-unused charset to pursue
interoperation with unspecified broken java software, at the expense of
interoperating with all current native applications.

The irony is thick, because the only reason we're not using the UTF-8
everywhere is to do it like the natives do.

Carl-Eric Menzel
Joined: 2009-04-25,
User offline. Last seen 42 years 45 weeks ago.
Re: default charsets and Source.from behavior

On Sat, 25 Apr 2009 10:25:15 +0200
martin odersky wrote:
> In the end the question is: Do we want to make it easy to move files
> between Java (or other System programs) and Scala, or do we want to
> make it easy to move files between Scala programs running on different
> systems? My vote is on the former.

Hi, I'm new to the list (also to Scala, and I really like it so far) but
I'd like to add my non-binding voice to this too: Having been bitten
often enough by charset issues, I think Java is full of inconsistencies
anyway. Strings are Unicode, but Properties files are 8859-1. I've seen
app servers and servlet containers defaulting to either UTF-8 or
8859-1, or to whatever the system locale is set.

Thus I think that once you have to interface with Java or any other
legacy system, you are screwed anyway and *have* to check your
encoding. Relying on defaults here is going to bite you. It would be
nice to have Scala default to a sane encoding. In my opinion, that
would be UTF-8, since it's the only one (apart from the other UTF-*)
where I can mix different alphabets and get away with it.

Just my €0.02.
Carl-Eric

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: default charsets and Source.from behavior

On Sat, 2009-04-25 at 08:37 -0400, Erik Engbrecht wrote:
> 1. Should we support a default encoding at all?
>
>
> Yes. Character encodings are one of those things that people tend to
> not think about until they get burned by it, and even then they are
> likely to avoid thinking about it because they really don't understand
> it. Making someone explicitly set something in the name of
> correctness when they don't understand what they are setting is rather
> pointless.

No default is better than a bad default. For example, a while ago I used
java.io.{FileReader,FileWriter} to read and write a file for a
cross-platform application. I didn't think much of it since one can't
even pass the encoding to FileReader and FileWriter. A few months later,
some bug reports started showing up for users that used the application
in multiple platforms (one of them Mac OS X).

Obviously, the problem was that different encodings were being used in
different platforms. To make matters more interesting, the code that was
added to try and recover from this situation actually had some issues
too. Fun.

All of that could have been avoided with a good cross-platform default
_or_ no default. Hiding the encoding under the carpet with a bad default
is just not a good idea

Best,
Ismael

loverdos
Joined: 2008-11-18,
User offline. Last seen 2 years 27 weeks ago.
Re: default charsets and Source.from behavior
The case with Properties is just ridiculous (native2ascii anyone???). I am pretty confident that anyone out there working with properties under a multilingual environment has been either using a modified java.util.Properties source or another proper implementation (some ASF project has it, I cannot recall right now). 

On Sat, Apr 25, 2009 at 15:51, Carl-Eric Menzel <cm.scala-ml@users.bitforce.com> wrote:
On Sat, 25 Apr 2009 10:25:15 +0200
martin odersky <martin.odersky@epfl.ch> wrote:
> In the end the question is: Do we want to make it easy to move files
> between Java (or other System programs) and Scala, or do we want to
> make it easy to move files between Scala programs running on different
> systems? My vote is on the former.

Hi, I'm new to the list (also to Scala, and I really like it so far) but
I'd like to add my non-binding voice to this too: Having been bitten
often enough by charset issues, I think Java is full of inconsistencies
anyway. Strings are Unicode, but Properties files are 8859-1. I've seen
app servers and servlet containers defaulting to either UTF-8 or
8859-1, or to whatever the system locale is set.

Thus I think that once you have to interface with Java or any other
legacy system, you are screwed anyway and *have* to check your
encoding. Relying on defaults here is going to bite you. It would be
nice to have Scala default to a sane encoding. In my opinion, that
would be UTF-8, since it's the only one (apart from the other UTF-*)
where I can mix different alphabets and get away with it.

Just my €0.02.
Carl-Eric



--
 __~O
-\ <,       Christos KK Loverdos
(*)/ (*)      http://ckkloverdos.com
loverdos
Joined: 2008-11-18,
User offline. Last seen 2 years 27 weeks ago.
Re: default charsets and Source.from behavior
From design perspective, I would like to put forth a few ideas. Some can be mixed together.
1. Do not use a plain String to represent an encoding. Use a typefull alternative. For example, there is java.nio.Charset. But in the past I have also employed custom enums (in Java).
2. Do not use an encoding/charset (typefull or not) directly. Use either a factory or (in Scala) a named parameter. Even if you have a default factory (say, PlatformFactory), you can always change its _implementation_ in the future without recompiling all sources, but of course you are changing semantics...
3. See if implicits can be used (Paul has already proposed this)


On Sat, Apr 25, 2009 at 15:51, Paul Phillips <paulp@improving.org> wrote:
OK, I'm out of arguments.  We'll do it like java.

My fallback position was to suggest no defaults for scala, and indeed
miles brought that possibility up.

I haven't looked too closely at the named/default arguments for 2.8.
Are default implicits possible? That's the default I've always wanted.
Can we do something clever with charsets passed as implicit values?
Charsets do fall squarely in the category of parameters I would view as
implicit-appropriate.

In the same vein as "easy things easy and hard things possible", I
personally favor making "good engineering easy and legacy interoperation
possible", and I think we might be going the other way on this one.  But
there are good arguments on all sides, that I don't dispute.

On Sat, Apr 25, 2009 at 08:37:47AM -0400, Erik Engbrecht wrote:
> On Mac I think it would be reasonable to detect the default character
> set being MacRoman and flip it to UTF-8.

This is the one move I'm going to realllly miss if we're determined to
do things just like java.  Everyone and their brother knows java is
wrong here.  UTF-8 has been the default charset on the mac since OS X
came out, so we would be defaulting to a long-unused charset to pursue
interoperation with unspecified broken java software, at the expense of
interoperating with all current native applications.

The irony is thick, because the only reason we're not using the UTF-8
everywhere is to do it like the natives do.

--
Paul Phillips      | On two occasions, I have been asked, 'Mr. Babbage, if you
Everyman           | put into the machine wrong figures, will the right answers
Empiricist         | come out?' I am not able to rightly apprehend the kind of
pp: i haul pills   | confusion of ideas that could provoke such a question.



--
 __~O
-\ <,       Christos KK Loverdos
(*)/ (*)      http://ckkloverdos.com
Erik Engbrecht
Joined: 2008-12-19,
User offline. Last seen 3 years 18 weeks ago.
Re: default charsets and Source.from behavior
But what's a bad default?
I think we can agree that MacRoman is a bad default.
So on Mac, when Scala is launched I think it would be reasonable to:1.  Check if -Dfile.encoding=X is set in JAVA_OPTS 2.  If it is set, then follow it.  The user said that's the default encoding.  The user should be obeyed.3.  If it's not set, look in some other useful place.  For example, my Mac seems to have the environment variable:      LANG=en_US.UTF-8I think it would be entirely reasonable to use the UTF-8 part, or whatever else I may set it to.4.  If there isn't a reasonable value defined in the environment somewhere to use, set -Dfile.encoding=UTF-8      - this is assuming Scala knows it is on a JVM with a bad default encoding, like OSX
What I don't think Scala should do is:1.  Ignore the environment and just pull something out of a config file that's part of the Scala distribution 2.  Hardcode the default character encoding somewhere in the source3.  Use a different mechanism for representing the default encoding than Java
#3 is important because I should be able to mix calls to Scala IO libraries and Java IO libraries and have consistent behavior.  Java picking MacRoman and Scala picking UTF-8 would be more broken than just using MacRoman.
On Sat, Apr 25, 2009 at 9:14 AM, Ismael Juma <mlists@juma.me.uk> wrote:
On Sat, 2009-04-25 at 08:37 -0400, Erik Engbrecht wrote:
>         1. Should we support a default encoding at all?
>
>
> Yes.  Character encodings are one of those things that people tend to
> not think about until they get burned by it, and even then they are
> likely to avoid thinking about it because they really don't understand
> it.  Making someone explicitly set something in the name of
> correctness when they don't understand what they are setting is rather
> pointless.

No default is better than a bad default. For example, a while ago I used
java.io.{FileReader,FileWriter} to read and write a file for a
cross-platform application. I didn't think much of it since one can't
even pass the encoding to FileReader and FileWriter. A few months later,
some bug reports started showing up for users that used the application
in multiple platforms (one of them Mac OS X).

Obviously, the problem was that different encodings were being used in
different platforms. To make matters more interesting, the code that was
added to try and recover from this situation actually had some issues
too. Fun.

All of that could have been avoided with a good cross-platform default
_or_ no default. Hiding the encoding under the carpet with a bad default
is just not a good idea

Best,
Ismael




--
http://erikengbrecht.blogspot.com/

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland