Re: Support for Ropes in Scala]

Tue, 2010-12-21, 17:17

#1

soc

Joined: 2010-02-07,

Re: Unicode support in Scala Was: Support for Ropes in Scala

Hi!

Interesting! This seems to mirror some of the things people plan to do
in Perl 6.

But generally I think Python is generally not a good example.

They still haven't figured out how "big" a "char" should be.
So some methods (like length) are not only false most of the time you
use anything outside the "common" rage,
they even differ between platforms and versions.

Bye,

Simon

Tue, 2010-12-21, 22:27

#2

arya

Joined: 2010-02-11,

Re: Support for Ropes in Scala]

Wouldn't it be as simple as switching type String = scala.JimString and adding one or two implicit defs in Predef?Then only those with "import java.lang.String" in their code, of which I suspect there are very few, since it's automatically imported, would be affected.
?Arya

On Tue, Dec 21, 2010 at 10:59 AM, Erik Osheim <erik@plastic-idolatry.com> wrote:

I failed to send this to the list yesterday (instead replying only to
Duarhs) and figured I'd pass it along. It's worth noting that other
languages have do have better solutions to this problem (for instance
Python 3 [1]).

[1] http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

Tue, 2010-12-21, 23:07

#3

Tony Morris 2

Joined: 2009-03-20,

Re: Support for Ropes in Scala]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 22/12/10 01:59, Erik Osheim wrote:
> Thus, you could build a new class (JimString) that represents unicode
> strings as an array of unicode glyphs (integers), but you wouldn't
> actually be able to *use* it with anyone else's libraries without some
> magic from within Scala. And without even more magic you wouldn't be
> able to use them with Java classes expecting to receive instances of
> java.lang.String.
FYI, GHC has a way around this.
http://www.haskell.org/ghc/docs/6.12.2/html/users_guide/type-class-extensions.html#overloaded-strings

- --
Tony Morris
http://tmorris.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0RI9kACgkQmnpgrYe6r61nDwCgiMaiizwZ43S743EMSoKMM7Hl
9DkAoLlKUIiL5waWs4Pn3MjjeZDTW6Zo
=BfcT
-----END PGP SIGNATURE-----

Tue, 2010-12-21, 23:27

#4

Jon Pretty 2

Joined: 2010-01-11,

Re: Support for Ropes in Scala]

Hi Arya,

Arya Irani wrote:
> Wouldn't it be as simple as switching type String = scala.JimString and
> adding one or two implicit defs in Predef?
> Then only those with "import java.lang.String" in their code, of which I
> suspect there are very few, since it's automatically imported, would be
> affected.

Unfortunately it's not that simple. Given that there are going to be lots of
java.lang.Strings floating around from any existing libraries, implicits will only come into
play in expressions which wouldn't otherwise typecheck. So while a call to

myJavaString.replaceAll("[a-z]*", "")

could be persuaded to invoke an implicit conversion (due to string literals being JimStrings
in our hypothetical world and replaceAll(JimString, JimString) not existing on
java.lang.Strings), something like

myJavaString == myJimString

will use the == method defined on myJavaString because its parameter always takes Any, i.e.
it typechecks already.

Sorry.

Cheers,
Jon

Tue, 2010-12-21, 23:57

#5

arya

Joined: 2010-02-11,

Re: Support for Ropes in Scala]

On Tue, Dec 21, 2010 at 5:17 PM, Jon Pretty <_@scala.propensive.com> wrote:

Hi Arya,

Arya Irani wrote:
> Wouldn't it be as simple as switching type String = scala.JimString and
> adding one or two implicit defs in Predef?
> Then only those with "import java.lang.String" in their code, of which I
> suspect there are very few, since it's automatically imported, would be
> affected.

Unfortunately it's not that simple. Given that there are going to be lots of
java.lang.Strings floating around from any existing libraries, implicits will only come into
play in expressions which wouldn't otherwise typecheck. So while a call to

myJavaString.replaceAll("[a-z]*", "")

could be persuaded to invoke an implicit conversion (due to string literals being JimStrings
in our hypothetical world and replaceAll(JimString, JimString) not existing on
java.lang.Strings), something like

myJavaString == myJimString

will use the == method defined on myJavaString because its parameter always takes Any, i.e.
it typechecks already.

Sorry.

Cheers,
Jon

--
Jon Pretty

Hmm, ok, good point. Where did myJavaString come from though?
// java.lang.Stringval broken = SomeLibraryWhichMightBeJava.someCallWhichIDontRealizeWillProduceABrokenString
// implicit conversion, but way too verbose! val ok: JimString = SomeLibraryWhichMightBeJava.someCallWhichIDontRealizeWillProduceABrokenString
Ok, so... you'd need every java.lang.String to be autoboxed into a scala.JimString, and except (or unboxed) where java.lang.String is required. Thoughts?
I've never written a compiler plugin...
-Arya

Wed, 2010-12-22, 00:27

#6

Jon Pretty 2

Joined: 2010-01-11,

Re: Support for Ropes in Scala]

Arya Irani wrote:
> Hmm, ok, good point. Where did myJavaString come from though?

It could have come form /so many/ different places... ;-)

> Ok, so... you'd need every java.lang.String to be autoboxed into a
> scala.JimString, and except (or unboxed) where java.lang.String is
> required. Thoughts?

You could do that. But I'm playing devil's advocate, and I think the problem is deeper than
that. Consider the following:

val js : JimString = "Cafébabe" // let's pretend é takes up two code points

org.random.javaproject.StringTools.insert(js, " ", js.indexOf("babe"))

I would like the insert method to give me the string "Café babe", but no matter how correct
the implementation of indexOf for JimString is, the implicit conversion will only convert
the JimString to a String; it won't convert the 4 to a 5.

And ironically, the whole thing would have worked fine if we had never invented JimString...

I think the moral of the story is that it's too late to fix Java Strings.

That's not to say there isn't a need for a better Unicode library, but it shouldn't be
seamless (seams make subtle boundaries clear) or the default, and a prerequisite of using it
should be a good understanding of the issues surrounding Java Strings.

Jon

Wed, 2010-12-22, 00:57

#7

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Tue, Dec 21, 2010 at 2:17 PM, Jon Pretty <_@scala.propensive.com> wrote:
> Hi Arya,
>
> Arya Irani wrote:
>> Wouldn't it be as simple as switching type String = scala.JimString and
>> adding one or two implicit defs in Predef?
>> Then only those with "import java.lang.String" in their code, of which I
>> suspect there are very few, since it's automatically imported, would be
>> affected.
>
> Unfortunately it's not that simple. Given that there are going to be lots of
> java.lang.Strings floating around from any existing libraries, implicits will only come into
> play in expressions which wouldn't otherwise typecheck. So while a call to
>
> myJavaString.replaceAll("[a-z]*", "")
>
> could be persuaded to invoke an implicit conversion (due to string literals being JimStrings
> in our hypothetical world and replaceAll(JimString, JimString) not existing on
> java.lang.Strings), something like
>
> myJavaString == myJimString
>
> will use the == method defined on myJavaString because its parameter always takes Any, i.e.
> it typechecks already.

That certainly is a problem. But note that there is no
java.lang.string#== ... the compiler translates `==` into an
invocation of the equals method, and it (or a plugin) could do
something different for java.lang.String (which is possible since its
final).

Another approach is to stick with java.lang.String as the default
Scala string -- a bad choice for memory or speed optimization but
sensible for Java interop -- but change StringLike and StringOps (or
provide an alternative version) to be defined correctly. The fact is
that UTF16 strings are *not* indexable; to get an indexable string,
you need to convert it to UTF32 or Array[Int] (or Array[UnicodeChar]).

Its unfortunate that the designers of Scala, who made so many
brilliant decisions, overlooked Java's huge gaping design error that
was well known at the time of Scala's design. Java has since been
"fixed" to provide methods to deal with full Unicode -- as long as one
assiduously avoids the char type and avoids indexing or slicing
Strings. If Scala does not provide full support of Unicode, and in a
way consistent with its functional programming features, I fear that
this will become a large impediment to its acceptance and success.

Wed, 2010-12-22, 01:17

#8

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com> wrote:
> Arya Irani wrote:
>> Hmm, ok, good point. Where did myJavaString come from though?
>
> It could have come form /so many/ different places... ;-)
>
>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> scala.JimString, and except (or unboxed) where java.lang.String is
>> required. Thoughts?
>
> You could do that. But I'm playing devil's advocate, and I think the problem is deeper than
> that. Consider the following:
>
> val js : JimString = "Cafébabe" // let's pretend é takes up two code points
>
> org.random.javaproject.StringTools.insert(js, " ", js.indexOf("babe"))
>
> I would like the insert method to give me the string "Café babe", but no matter how correct
> the implementation of indexOf for JimString is, the implicit conversion will only convert
> the JimString to a String; it won't convert the 4 to a 5.

But but but ... 4 is the *correct* value; if
org.random.javaproject.StringTools.insert needs a 5 there to do the
right thing, then it is broken -- it doesn't handle full Unicode. If
that insert method is coded correctly, then it does not do naive
indexing or slicing operations ... it traverses the string one
character at a time and inserts the space after the 4th character --
*not* the 4th 16-bit "char" -- please read the java.lang.Character API
that I posted previously. Keep in mind that java.lang.String itself is
not broken, it's a valid Unicode representation (UTF-16), so
converting another valid Unicode representation, whether js is UTF-8
or UTF-16 or UTF-32, to java.lang.String is not a problem.

> And ironically, the whole thing would have worked fine if we had never invented JimString...

Well, no, it would not have "worked fine", because being broken for
Unicode supplemental characters is not "fine".

> I think the moral of the story is that it's too late to fix Java Strings.

It isn't necessary to fix Java Strings -- it's String *handling*, and
the Java char and Scala Char types, that are broken. *Optionally* one
could change the default String type to be UTF-8 or UTF-32, because
UTF-16 is suboptimal for both space and speed (unless one brokenly
treats UTF-16 as an indexable array), but it isn't necessary for
correctness.

> That's not to say there isn't a need for a better Unicode library, but it shouldn't be
> seamless (seams make subtle boundaries clear) or the default,

The current default is *broken* and results in all Scala code that
operates on Strings as a collection of characters being broken. At the
very least we need a transition path from that broken default to
something that isn't broken.

> and a prerequisite of using it
> should be a good understanding of the issues surrounding Java Strings.

Yes, indeed, but that's something that some of the discussants here
clearly lack.

Wed, 2010-12-22, 01:27

#9

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

P.S. A moral that can be drawn from Jon's example is that, as more and
more Java code is converted to handle Strings properly by using the
codePoint methods that were added to java.lang.Character in 1.5,
interoperability between Scala and Java will break down. Given a
*properly* coded Java insert method and Scala's current indexOf method
on Strings, or v.v., you will get the wrong result for Strings that
contain supplemental characters. Strings are not arrays of fixed-size
chars, they are sequences of variable-size characters, and commercial
Java programmers now know that and have the tools to program them
properly -- but laboriously. But the only tools Scala programmers have
are those awful java.lang.Character methods; natural Scala approaches
give the wrong results.

On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter wrote:
> On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com> wrote:
>> Arya Irani wrote:
>>> Hmm, ok, good point. Where did myJavaString come from though?
>>
>> It could have come form /so many/ different places... ;-)
>>
>>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>>> scala.JimString, and except (or unboxed) where java.lang.String is
>>> required. Thoughts?
>>
>> You could do that. But I'm playing devil's advocate, and I think the problem is deeper than
>> that. Consider the following:
>>
>> val js : JimString = "Cafébabe" // let's pretend é takes up two code points
>>
>> org.random.javaproject.StringTools.insert(js, " ", js.indexOf("babe"))
>>
>> I would like the insert method to give me the string "Café babe", but no matter how correct
>> the implementation of indexOf for JimString is, the implicit conversion will only convert
>> the JimString to a String; it won't convert the 4 to a 5.
>
> But but but ... 4 is the *correct* value; if
> org.random.javaproject.StringTools.insert needs a 5 there to do the
> right thing, then it is broken -- it doesn't handle full Unicode. If
> that insert method is coded correctly, then it does not do naive
> indexing or slicing operations ... it traverses the string one
> character at a time and inserts the space after the 4th character --
> *not* the 4th 16-bit "char" -- please read the java.lang.Character API
> that I posted previously. Keep in mind that java.lang.String itself is
> not broken, it's a valid Unicode representation (UTF-16), so
> converting another valid Unicode representation, whether js is UTF-8
> or UTF-16 or UTF-32, to java.lang.String is not a problem.
>
>> And ironically, the whole thing would have worked fine if we had never invented JimString...
>
> Well, no, it would not have "worked fine", because being broken for
> Unicode supplemental characters is not "fine".
>
>> I think the moral of the story is that it's too late to fix Java Strings.
>
> It isn't necessary to fix Java Strings -- it's String *handling*, and
> the Java char and Scala Char types, that are broken. *Optionally* one
> could change the default String type to be UTF-8 or UTF-32, because
> UTF-16 is suboptimal for both space and speed (unless one brokenly
> treats UTF-16 as an indexable array), but it isn't necessary for
> correctness.
>
>> That's not to say there isn't a need for a better Unicode library, but it shouldn't be
>> seamless (seams make subtle boundaries clear) or the default,
>
> The current default is *broken* and results in all Scala code that
> operates on Strings as a collection of characters being broken. At the
> very least we need a transition path from that broken default to
> something that isn't broken.
>
>> and a prerequisite of using it
>> should be a good understanding of the issues surrounding Java Strings.
>
> Yes, indeed, but that's something that some of the discussants here
> clearly lack.
>

Wed, 2010-12-22, 01:47

#10

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter wrote:
> On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com> wrote:

>> And ironically, the whole thing would have worked fine if we had never invented JimString...
>
> Well, no, it would not have "worked fine", because being broken for
> Unicode supplemental characters is not "fine".

Let me clarify that:

If the string does not contain supplemental characters, then it works
fine in all cases. If it does contain supplemental characters, then

1) If insert is not aware of supplemental characters but indexOf is,
it doesn't work.
2) If neither insert nor indexOf is aware of supplemental characters,
then it does work.

You are only comparing those two cases and thus deriving "irony". But:

3) If insert is aware of supplemental characters but indexOf is not,
it doesn't work.
4) If both insert and indexOf are aware of supplemental characters,
then it does work.

You can't just assume that all Java code is and will forever be broken
as an argument against fixing string handling in Scala. Java code is
being and will continue to be fixed, and counting on things happening
to work with broken code is no help in reality as opposed to isolated
examples. If you want your code to work with full Unicode, you will
have to carefully check that all methods that you call work with full
Unicode, whatever the language.

Wed, 2010-12-22, 01:57

#11

Jon Pretty 2

Joined: 2010-01-11,

Re: Support for Ropes in Scala]

Hi Jim,

Jim Balter wrote:
> But but but ... 4 is the *correct* value; if
> org.random.javaproject.StringTools.insert needs a 5 there to do the
> right thing, then it is broken -- it doesn't handle full Unicode.

Correct. I'm assuming it's been written to be consistent with Java's indexing.

> If
> that insert method is coded correctly, then it does not do naive
> indexing or slicing operations ... it traverses the string one
> character at a time and inserts the space after the 4th character --
> *not* the 4th 16-bit "char" -- please read the java.lang.Character API
> that I posted previously. Keep in mind that java.lang.String itself is
> not broken, it's a valid Unicode representation (UTF-16), so
> converting another valid Unicode representation, whether js is UTF-8
> or UTF-16 or UTF-32, to java.lang.String is not a problem.

The point was there there's a lot of code out there which is broken. But it's broken in a
fairly consistent way.

> Well, no, it would not have "worked fine", because being broken for
> Unicode supplemental characters is not "fine".

Sorry, I meant that the example alone would have worked fine. My point was that a lot of
the brokenness cancels itself out, which is jolly nice and convenient most of the time.

> The current default is *broken* and results in all Scala code that
> operates on Strings as a collection of characters being broken. At the
> very least we need a transition path from that broken default to
> something that isn't broken.

I could be perfectly satisfied that my code weren't broken if I'm only accessing or reading
data in from encodings which I know to be subsets of the BMP. Often we can know this.

> Yes, indeed, but that's something that some of the discussants here
> clearly lack.

Myself included. But the reason it's so rarely a problem is that the characters which cause
problems are so rarely used. For a glimpse of what Java won't handle correctly, have a look
at the what's in the supplementary planes:

http://en.wikipedia.org/wiki/Plane_(Unicode)

To summarize flippantly, it's stuff people don't use. I don't want to labour the point,
because I understand there is a problem. I'm just acknowledging that most people, most
programmers, won't ever experience it. How often do you see the telltale signs of UTF-8
being interpreted as Latin1 or ASCII? This is a million times less common.

BUT, having said that, an extremely strong motivator for fixing broken code is that it poses
potential security hazards. The sad thing is that mixing broken and unbroken code creates
more risk in this respect during the transition.

Cheers,
Jon

Wed, 2010-12-22, 02:37

#12

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Tue, Dec 21, 2010 at 4:55 PM, Jon Pretty <_@scala.propensive.com> wrote:
> Hi Jim,
>
> Jim Balter wrote:
>> But but but ... 4 is the *correct* value; if
>> org.random.javaproject.StringTools.insert needs a 5 there to do the
>> right thing, then it is broken -- it doesn't handle full Unicode.
>
> Correct. I'm assuming it's been written to be consistent with Java's indexing.
>
>> If
>> that insert method is coded correctly, then it does not do naive
>> indexing or slicing operations ... it traverses the string one
>> character at a time and inserts the space after the 4th character --
>> *not* the 4th 16-bit "char" -- please read the java.lang.Character API
>> that I posted previously. Keep in mind that java.lang.String itself is
>> not broken, it's a valid Unicode representation (UTF-16), so
>> converting another valid Unicode representation, whether js is UTF-8
>> or UTF-16 or UTF-32, to java.lang.String is not a problem.
>
> The point was there there's a lot of code out there which is broken. But it's broken in a
> fairly consistent way.
>
>> Well, no, it would not have "worked fine", because being broken for
>> Unicode supplemental characters is not "fine".
>
> Sorry, I meant that the example alone would have worked fine. My point was that a lot of
> the brokenness cancels itself out, which is jolly nice and convenient most of the time.

I appreciate the care with which you've laid out your argument here,
and I think I've addressed your points above in other messages. I
think we're in agreement on the facts so far.

>> The current default is *broken* and results in all Scala code that
>> operates on Strings as a collection of characters being broken. At the
>> very least we need a transition path from that broken default to
>> something that isn't broken.
>
> I could be perfectly satisfied that my code weren't broken if I'm only accessing or reading
> data in from encodings which I know to be subsets of the BMP. Often we can know this.

I disagree that we can know this, unless we are reading data known to
be in ASCII, or we are reading data that we know doesn't contain
supplemental characters only because they haven't been supported in
the past. e.g., we know that compilable Scala programs do not contain
these characters because, if they did, they wouldn't compile.

>
>> Yes, indeed, but that's something that some of the discussants here
>> clearly lack.
>
> Myself included. But the reason it's so rarely a problem is that the characters which cause
> problems are so rarely used. For a glimpse of what Java won't handle correctly,

As I've noted, Java has the capabilities to handle them correctly. It
is only individual pieces of Java code that doesn't handle them
correctly. And all Scala code, unless there's Scala code that calls
the java.lang.Character methods that deal with codePoints, which I
doubt. AFAIK, the Java core libraries work correctly, but the Scala
libraries do not.

> have a look
> at the what's in the supplementary planes:
>
> http://en.wikipedia.org/wiki/Plane_(Unicode)
>
> To summarize flippantly, it's stuff people don't use.

Well, I'm sorry, but you're wrong. People are filing bug reports,
e.g., http://osdir.com/ml/text.xml.xerces-j.devel/2005-04/msg00074.html
and http://bugs.mysql.com/bug.php?id=54175 so they are using it. Sun
went to a lot of trouble to support it -- see
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ --
I don't think they would have if people weren't using it. There are
people whose names are spelled with supplemental characters.
Overgeneralizing from ourselves to "people" just won't do.

> I don't want to labour the point,
> because I understand there is a problem. I'm just acknowledging that most people, most
> programmers, won't ever experience it.

Certainly if the use of supplemental characters is discouraged because
of broken tools that force broken software. But, if the tools are
fixed, then most programmers will have to become aware of the issues,
and even include supplemental characters in test cases. If
supplemental characters were universally handled properly, then we
might see a considerable increase in their use, and even their
definition in Unicode. So I don't think we can be at all certain about
your prediction.

> How often do you see the telltale signs of UTF-8
> being interpreted as Latin1 or ASCII?

Quite frequently ... with a fairly steady decrease as tools have been fixed.

> This is a million times less common.

I don't think we can make that estimate at this time.

> BUT, having said that, an extremely strong motivator for fixing broken code is that it poses
> potential security hazards. The sad thing is that mixing broken and unbroken code creates
> more risk in this respect during the transition.

And if Java code continues to be fixed while Scala continues not to
be, Scala will more and more be the source of such risk.

Wed, 2010-12-22, 03:07

#13

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Tue, Dec 21, 2010 at 4:55 PM, Jon Pretty <_@scala.propensive.com> wrote:

> http://en.wikipedia.org/wiki/Plane_(Unicode)
>
> To summarize flippantly, it's stuff people don't use.

I looked at that page, saw the dismissive term "astral plane", and
stopped there, as I suspect you did. But then I went back and looked
at the citation for the term, which tells a quite different story:

http://www.tlg.uci.edu/~opoudjis/unicode/unicode_astral.html

"""
programmers have to assume a character can have a million possible
values, not just 64K, which means they often have to change their
existing code. Furthermore, they are not drastically common in use:
most 'real' scripts (though not all) are ensconced in the BMP. So
software support for the supplementary planes lags that of the BMP:
virtually no fonts contain them (Code2001 remains the honourable
exception, with the recent additions of Alphabetum, and for the
Unicode 4.1 Greek additions Cardo and New Athena Unicode); old
operating systems don't acknowledge them; some browsers still can't
deal with them; some text editors don't accept them; and so on. As of
this writing for instance, Dreamweaver MX for MacOSX (which I am
currently using to prepare this) will let you paste BMP text into its
WYSIWYG window; but pasting Supplementary Plane text there will make
it crash.
"""

i.e., there's a lot of buggy code out there.

"""
The informal name for the supplementary planes of Unicode is "astral
planes", since (especially in the late '90s) their use seemed to be as
remote as the theosophical "great beyond". There has been objection to
this jocular usage (see "string vs. char" and subsequent discussion on
Unicode list); and as Planes 1 and 2 spread in use there will be less
occasion to feel that the planes really are 'astral'. But the jocular
reference is harmless, and it serves as a reminder that we're not
quite there yet.
"""

That's nearly the opposite message as "it's stuff people don't use",
and your use of that phrase is a counterexample to the claim that the
term is harmless.

"""
Different planes are designated for different functions, as detailed
in the Unicode Roadmap:

* The Supplementary Multilingual Plane (SMP: Plane 1, U+010000 -
U+01FFFF), according to the Standard, is "dedicated to the encoding of
lesser-used historic scripts, special-purpose invented scripts, and
special notational systems, which either could not be fit into the BMP
or which would be of very infrequent usage." The scripts and systems
associated with Greek reside here.
* The Supplementary Ideographic Plane (SIP: Plane 2, U+020000 -
U+02FFFF) contains extra space for CJK (Chinese–Japanese–Korean)
characters, including Cantonese-specific characters and obsolete
characters.
* Since it looks like that will not be enough space for CJK, the
Auxiliary Ideographic Plane (AIP: Plane 3, U+030000 - U+03FFFF) has
been proposed as additional space (see Unicode list discussion).
* The Supplementary Special-Purpose Plane (SSP: Plane 14, U+0D0000
- U+0DFFFF) is designated for format control characters; this
currently includes glyph variation selectors, and language tags.
* Finally, Planes 15 and 16 (U+0E0000 - U+0FFFFF) have been
allocated for Private Use, just as U+E000 - U+F8FF have been in the
BMP.
"""

While *some* of that is stuff that not many people would use, some of
it, such as Cantonese and other CJK characters, and format control
characters, is not.

"""
In Mathematics, then, shifts of typeface, script, and style are
important enough to yield completely distinct meaning: the difference
between a script and a blackletter H is rather more grave in
Mathematics than it is in normal textual use of the Latin script. In
textual use, Helmut in Fraktur, italics, and cursive style are
identical in meaning. This is why the distinction is extraneous to the
notion of plain text: so the Latin script in Unicode does not have a
distinct codepoint for Fraktur H, Italic H, and Cursive H.

Mathematics does. In fact, there's a whole block of them at U+1D400 -
U+1D7FF: Mathematical Alphanumeric Symbols.
"""

It would be sad if the day comes when most applications, save those
written in Scala, properly handle certain mathematical texts. Sooner
or later, proper handling of full Unicode will be required, and the
"stuff people don't use" argument is a rationalization that doesn't
fly.

Wed, 2010-12-22, 17:47

#14

dcsobral

Joined: 2009-04-23,

Re: Support for Ropes in Scala]

Woah! Wait a second!

There isn't a "Scala's current indexOf method on String", because String is not a Scala class, and, naturally, neither is its indexOf method. It's JAVA. If you say Java got fixed, then it got fixed for both Java and Scala.

Both Char and String are _Java_.

So I don't understand how can you say that Java is getting fixed but Scala is stuck. Would you mind providing an example?

On Tue, Dec 21, 2010 at 22:25, Jim Balter <Jim@balter.name> wrote:

P.S. A moral that can be drawn from Jon's example is that, as more and
more Java code is converted to handle Strings properly by using the
codePoint methods that were added to java.lang.Character in 1.5,
interoperability between Scala and Java will break down. Given a
*properly* coded Java insert method and Scala's current indexOf method
on Strings, or v.v., you will get the wrong result for Strings that
contain supplemental characters. Strings are not arrays of fixed-size
chars, they are sequences of variable-size characters, and commercial
Java programmers now know that and have the tools to program them
properly -- but laboriously. But the only tools Scala programmers have
are those awful java.lang.Character methods; natural Scala approaches
give the wrong results.

On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter <Jim@balter.name> wrote:
> On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com> wrote:
>> Arya Irani wrote:
>>> Hmm, ok, good point. Where did myJavaString come from though?
>>
>> It could have come form /so many/ different places... ;-)
>>
>>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>>> scala.JimString, and except (or unboxed) where java.lang.String is
>>> required. Thoughts?
>>
>> You could do that. But I'm playing devil's advocate, and I think the problem is deeper than
>> that. Consider the following:
>>
>> val js : JimString = "Cafébabe" // let's pretend é takes up two code points
>>
>> org.random.javaproject.StringTools.insert(js, " ", js.indexOf("babe"))
>>
>> I would like the insert method to give me the string "Café babe", but no matter how correct
>> the implementation of indexOf for JimString is, the implicit conversion will only convert
>> the JimString to a String; it won't convert the 4 to a 5.
>
> But but but ... 4 is the *correct* value; if
> org.random.javaproject.StringTools.insert needs a 5 there to do the
> right thing, then it is broken -- it doesn't handle full Unicode. If
> that insert method is coded correctly, then it does not do naive
> indexing or slicing operations ... it traverses the string one
> character at a time and inserts the space after the 4th character --
> *not* the 4th 16-bit "char" -- please read the java.lang.Character API
> that I posted previously. Keep in mind that java.lang.String itself is
> not broken, it's a valid Unicode representation (UTF-16), so
> converting another valid Unicode representation, whether js is UTF-8
> or UTF-16 or UTF-32, to java.lang.String is not a problem.
>
>> And ironically, the whole thing would have worked fine if we had never invented JimString...
>
> Well, no, it would not have "worked fine", because being broken for
> Unicode supplemental characters is not "fine".
>
>> I think the moral of the story is that it's too late to fix Java Strings.
>
> It isn't necessary to fix Java Strings -- it's String *handling*, and
> the Java char and Scala Char types, that are broken. *Optionally* one
> could change the default String type to be UTF-8 or UTF-32, because
> UTF-16 is suboptimal for both space and speed (unless one brokenly
> treats UTF-16 as an indexable array), but it isn't necessary for
> correctness.
>
>> That's not to say there isn't a need for a better Unicode library, but it shouldn't be
>> seamless (seams make subtle boundaries clear) or the default,
>
> The current default is *broken* and results in all Scala code that
> operates on Strings as a collection of characters being broken. At the
> very least we need a transition path from that broken default to
> something that isn't broken.
>
>> and a prerequisite of using it
>> should be a good understanding of the issues surrounding Java Strings.
>
> Yes, indeed, but that's something that some of the discussants here
> clearly lack.
>

--
Daniel C. Sobral

I travel to the future all the time.

Thu, 2010-12-23, 00:27

#15

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

Again, read the java.lang.Character API, with its many codePoint
methods. As I said, you can code Java string handling correctly *if*
you assiduously avoid the char type and slicing or indexing strings.
And again, as I said, you can code it correctly in Scala if you use
those java.lang.Character methods -- but that's even more horrible to
do in Scala than in Java, where such imperative code is standard. In
Scala, you should be able to traverse a Java string as a sequence of
32-bit Unicode characters, even though it is *encoded* as an array of
16-bit UTF-16 code points. And you should be able to naturally
(functionally) traverse UTF-8 and UTF-32 encodings as well. And
character constants should be 32-bit entities, not 16-bit entities
(that's broken in Java too, but Java will never change whereas Scala
still can).

On Wed, Dec 22, 2010 at 8:40 AM, Daniel Sobral wrote:
> Woah! Wait a second!
>
> There isn't a "Scala's current indexOf method on String", because String is
> not a Scala class, and, naturally, neither is its indexOf method. It's JAVA.
> If you say Java got fixed, then it got fixed for both Java and Scala.
>
> Both Char and String are _Java_.
>
> So I don't understand how can you say that Java is getting fixed but Scala
> is stuck. Would you mind providing an example?
>
> On Tue, Dec 21, 2010 at 22:25, Jim Balter wrote:
>>
>> P.S. A moral that can be drawn from Jon's example is that, as more and
>> more Java code is converted to handle Strings properly by using the
>> codePoint methods that were added to java.lang.Character in 1.5,
>> interoperability between Scala and Java will break down. Given a
>> *properly* coded Java insert method and Scala's current indexOf method
>> on Strings, or v.v., you will get the wrong result for Strings that
>> contain supplemental characters. Strings are not arrays of fixed-size
>> chars, they are sequences of variable-size characters, and commercial
>> Java programmers now know that and have the tools to program them
>> properly -- but laboriously. But the only tools Scala programmers have
>> are those awful java.lang.Character methods; natural Scala approaches
>> give the wrong results.
>>
>>
>> On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter wrote:
>> > On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com>
>> > wrote:
>> >> Arya Irani wrote:
>> >>> Hmm, ok, good point. Where did myJavaString come from though?
>> >>
>> >> It could have come form /so many/ different places... ;-)
>> >>
>> >>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> >>> scala.JimString, and except (or unboxed) where java.lang.String is
>> >>> required. Thoughts?
>> >>
>> >> You could do that. But I'm playing devil's advocate, and I think the
>> >> problem is deeper than
>> >> that. Consider the following:
>> >>
>> >> val js : JimString = "Cafébabe" // let's pretend é takes up two
>> >> code points
>> >>
>> >> org.random.javaproject.StringTools.insert(js, " ",
>> >> js.indexOf("babe"))
>> >>
>> >> I would like the insert method to give me the string "Café babe", but
>> >> no matter how correct
>> >> the implementation of indexOf for JimString is, the implicit conversion
>> >> will only convert
>> >> the JimString to a String; it won't convert the 4 to a 5.
>> >
>> > But but but ... 4 is the *correct* value; if
>> > org.random.javaproject.StringTools.insert needs a 5 there to do the
>> > right thing, then it is broken -- it doesn't handle full Unicode. If
>> > that insert method is coded correctly, then it does not do naive
>> > indexing or slicing operations ... it traverses the string one
>> > character at a time and inserts the space after the 4th character --
>> > *not* the 4th 16-bit "char" -- please read the java.lang.Character API
>> > that I posted previously. Keep in mind that java.lang.String itself is
>> > not broken, it's a valid Unicode representation (UTF-16), so
>> > converting another valid Unicode representation, whether js is UTF-8
>> > or UTF-16 or UTF-32, to java.lang.String is not a problem.
>> >
>> >> And ironically, the whole thing would have worked fine if we had never
>> >> invented JimString...
>> >
>> > Well, no, it would not have "worked fine", because being broken for
>> > Unicode supplemental characters is not "fine".
>> >
>> >> I think the moral of the story is that it's too late to fix Java
>> >> Strings.
>> >
>> > It isn't necessary to fix Java Strings -- it's String *handling*, and
>> > the Java char and Scala Char types, that are broken. *Optionally* one
>> > could change the default String type to be UTF-8 or UTF-32, because
>> > UTF-16 is suboptimal for both space and speed (unless one brokenly
>> > treats UTF-16 as an indexable array), but it isn't necessary for
>> > correctness.
>> >
>> >> That's not to say there isn't a need for a better Unicode library, but
>> >> it shouldn't be
>> >> seamless (seams make subtle boundaries clear) or the default,
>> >
>> > The current default is *broken* and results in all Scala code that
>> > operates on Strings as a collection of characters being broken. At the
>> > very least we need a transition path from that broken default to
>> > something that isn't broken.
>> >
>> >> and a prerequisite of using it
>> >> should be a good understanding of the issues surrounding Java Strings.
>> >
>> > Yes, indeed, but that's something that some of the discussants here
>> > clearly lack.
>> >
>
>
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
>

Thu, 2010-12-23, 14:07

#16

dcsobral

Joined: 2009-04-23,

Re: Support for Ropes in Scala]

Scala _cannot_ change that, for two reasons:
1. A Char is an AnyVal, and any replacement would have to be an AnyRef. This has consequences all the way down to the turtles.
2. Char and String are supposed to be the same in Scala as in Java, precisely so there's no "translation layer" between Java and Scala.
If Odersky had not opted to make Scala interoperate with Java so transparently, then it might have been another matter. As it is, if you get a Char interface from Java, then you pass a Char to it, and vice versa.
So, Char will always be a Java's char and String will always be a Java's String. Any improvement must be done without changing this. One could have an UTF32String and UTF32Char, for example, and all methods that come with it, plus implicit conversions to and from String, and from Char.

On Wed, Dec 22, 2010 at 21:18, Jim Balter <Jim@balter.name> wrote:

Again, read the java.lang.Character API, with its many codePoint
methods. As I said, you can code Java string handling correctly *if*
you assiduously avoid the char type and slicing or indexing strings.
And again, as I said, you can code it correctly in Scala if you use
those java.lang.Character methods -- but that's even more horrible to
do in Scala than in Java, where such imperative code is standard. In
Scala, you should be able to traverse a Java string as a sequence of
32-bit Unicode characters, even though it is *encoded* as an array of
16-bit UTF-16 code points. And you should be able to naturally
(functionally) traverse UTF-8 and UTF-32 encodings as well. And
character constants should be 32-bit entities, not 16-bit entities
(that's broken in Java too, but Java will never change whereas Scala
still can).

On Wed, Dec 22, 2010 at 8:40 AM, Daniel Sobral <dcsobral@gmail.com> wrote:
> Woah! Wait a second!
>
> There isn't a "Scala's current indexOf method on String", because String is
> not a Scala class, and, naturally, neither is its indexOf method. It's JAVA.
> If you say Java got fixed, then it got fixed for both Java and Scala.
>
> Both Char and String are _Java_.
>
> So I don't understand how can you say that Java is getting fixed but Scala
> is stuck. Would you mind providing an example?
>
> On Tue, Dec 21, 2010 at 22:25, Jim Balter <Jim@balter.name> wrote:
>>
>> P.S. A moral that can be drawn from Jon's example is that, as more and
>> more Java code is converted to handle Strings properly by using the
>> codePoint methods that were added to java.lang.Character in 1.5,
>> interoperability between Scala and Java will break down. Given a
>> *properly* coded Java insert method and Scala's current indexOf method
>> on Strings, or v.v., you will get the wrong result for Strings that
>> contain supplemental characters. Strings are not arrays of fixed-size
>> chars, they are sequences of variable-size characters, and commercial
>> Java programmers now know that and have the tools to program them
>> properly -- but laboriously. But the only tools Scala programmers have
>> are those awful java.lang.Character methods; natural Scala approaches
>> give the wrong results.
>>
>>
>> On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter <Jim@balter.name> wrote:
>> > On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com>
>> > wrote:
>> >> Arya Irani wrote:
>> >>> Hmm, ok, good point. Where did myJavaString come from though?
>> >>
>> >> It could have come form /so many/ different places... ;-)
>> >>
>> >>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> >>> scala.JimString, and except (or unboxed) where java.lang.String is
>> >>> required. Thoughts?
>> >>
>> >> You could do that. But I'm playing devil's advocate, and I think the
>> >> problem is deeper than
>> >> that. Consider the following:
>> >>
>> >> val js : JimString = "Cafébabe" // let's pretend é takes up two
>> >> code points
>> >>
>> >> org.random.javaproject.StringTools.insert(js, " ",
>> >> js.indexOf("babe"))
>> >>
>> >> I would like the insert method to give me the string "Café babe", but
>> >> no matter how correct
>> >> the implementation of indexOf for JimString is, the implicit conversion
>> >> will only convert
>> >> the JimString to a String; it won't convert the 4 to a 5.
>> >
>> > But but but ... 4 is the *correct* value; if
>> > org.random.javaproject.StringTools.insert needs a 5 there to do the
>> > right thing, then it is broken -- it doesn't handle full Unicode. If
>> > that insert method is coded correctly, then it does not do naive
>> > indexing or slicing operations ... it traverses the string one
>> > character at a time and inserts the space after the 4th character --
>> > *not* the 4th 16-bit "char" -- please read the java.lang.Character API
>> > that I posted previously. Keep in mind that java.lang.String itself is
>> > not broken, it's a valid Unicode representation (UTF-16), so
>> > converting another valid Unicode representation, whether js is UTF-8
>> > or UTF-16 or UTF-32, to java.lang.String is not a problem.
>> >
>> >> And ironically, the whole thing would have worked fine if we had never
>> >> invented JimString...
>> >
>> > Well, no, it would not have "worked fine", because being broken for
>> > Unicode supplemental characters is not "fine".
>> >
>> >> I think the moral of the story is that it's too late to fix Java
>> >> Strings.
>> >
>> > It isn't necessary to fix Java Strings -- it's String *handling*, and
>> > the Java char and Scala Char types, that are broken. *Optionally* one
>> > could change the default String type to be UTF-8 or UTF-32, because
>> > UTF-16 is suboptimal for both space and speed (unless one brokenly
>> > treats UTF-16 as an indexable array), but it isn't necessary for
>> > correctness.
>> >
>> >> That's not to say there isn't a need for a better Unicode library, but
>> >> it shouldn't be
>> >> seamless (seams make subtle boundaries clear) or the default,
>> >
>> > The current default is *broken* and results in all Scala code that
>> > operates on Strings as a collection of characters being broken. At the
>> > very least we need a transition path from that broken default to
>> > something that isn't broken.
>> >
>> >> and a prerequisite of using it
>> >> should be a good understanding of the issues surrounding Java Strings.
>> >
>> > Yes, indeed, but that's something that some of the discussants here
>> > clearly lack.
>> >
>
>
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
>

--
Daniel C. Sobral

I travel to the future all the time.

Thu, 2010-12-23, 18:57

#17

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral wrote:
> Scala _cannot_ change that, for two reasons:
> 1. A Char is an AnyVal, and any replacement would have to be an AnyRef. This
> has consequences all the way down to the turtles.
> 2. Char and String are supposed to be the same in Scala as in Java,
> precisely so there's no "translation layer" between Java and Scala.
> If Odersky had not opted to make Scala interoperate with Java so
> transparently, then it might have been another matter. As it is, if you get
> a Char interface from Java, then you pass a Char to it, and vice versa.
> So, Char will always be a Java's char and String will always be a Java's
> String. Any improvement must be done without changing this. One could have
> an UTF32String and UTF32Char, for example, and all methods that come with
> it, plus implicit conversions to and from String, and from Char.

I disagree that it can't be changed, and will be working on a design
and a proposal. And I think that, eventually, it must change, and so
the sooner the better. No longer does Scala interoperate with Java
because the Java people woke up and realized how broken it is. All
code that deals with Java chars is broken; eventually all use of it
will be abandoned, except by amateurs who don't know what they're
doing, and others who don't care whether their code is broken . The
new Java "char", as can be seen from the java.lang.Character methods,
is called "int".

> On Wed, Dec 22, 2010 at 21:18, Jim Balter wrote:
>>
>> Again, read the java.lang.Character API, with its many codePoint
>> methods. As I said, you can code Java string handling correctly *if*
>> you assiduously avoid the char type and slicing or indexing strings.
>> And again, as I said, you can code it correctly in Scala if you use
>> those java.lang.Character methods -- but that's even more horrible to
>> do in Scala than in Java, where such imperative code is standard. In
>> Scala, you should be able to traverse a Java string as a sequence of
>> 32-bit Unicode characters, even though it is *encoded* as an array of
>> 16-bit UTF-16 code points. And you should be able to naturally
>> (functionally) traverse UTF-8 and UTF-32 encodings as well. And
>> character constants should be 32-bit entities, not 16-bit entities
>> (that's broken in Java too, but Java will never change whereas Scala
>> still can).
>>
>> On Wed, Dec 22, 2010 at 8:40 AM, Daniel Sobral wrote:
>> > Woah! Wait a second!
>> >
>> > There isn't a "Scala's current indexOf method on String", because String
>> > is
>> > not a Scala class, and, naturally, neither is its indexOf method. It's
>> > JAVA.
>> > If you say Java got fixed, then it got fixed for both Java and Scala.
>> >
>> > Both Char and String are _Java_.
>> >
>> > So I don't understand how can you say that Java is getting fixed but
>> > Scala
>> > is stuck. Would you mind providing an example?
>> >
>> > On Tue, Dec 21, 2010 at 22:25, Jim Balter wrote:
>> >>
>> >> P.S. A moral that can be drawn from Jon's example is that, as more and
>> >> more Java code is converted to handle Strings properly by using the
>> >> codePoint methods that were added to java.lang.Character in 1.5,
>> >> interoperability between Scala and Java will break down. Given a
>> >> *properly* coded Java insert method and Scala's current indexOf method
>> >> on Strings, or v.v., you will get the wrong result for Strings that
>> >> contain supplemental characters. Strings are not arrays of fixed-size
>> >> chars, they are sequences of variable-size characters, and commercial
>> >> Java programmers now know that and have the tools to program them
>> >> properly -- but laboriously. But the only tools Scala programmers have
>> >> are those awful java.lang.Character methods; natural Scala approaches
>> >> give the wrong results.
>> >>
>> >>
>> >> On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter wrote:
>> >> > On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com>
>> >> > wrote:
>> >> >> Arya Irani wrote:
>> >> >>> Hmm, ok, good point. Where did myJavaString come from though?
>> >> >>
>> >> >> It could have come form /so many/ different places... ;-)
>> >> >>
>> >> >>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> >> >>> scala.JimString, and except (or unboxed) where java.lang.String is
>> >> >>> required. Thoughts?
>> >> >>
>> >> >> You could do that. But I'm playing devil's advocate, and I think
>> >> >> the
>> >> >> problem is deeper than
>> >> >> that. Consider the following:
>> >> >>
>> >> >> val js : JimString = "Cafébabe" // let's pretend é takes up two
>> >> >> code points
>> >> >>
>> >> >> org.random.javaproject.StringTools.insert(js, " ",
>> >> >> js.indexOf("babe"))
>> >> >>
>> >> >> I would like the insert method to give me the string "Café babe",
>> >> >> but
>> >> >> no matter how correct
>> >> >> the implementation of indexOf for JimString is, the implicit
>> >> >> conversion
>> >> >> will only convert
>> >> >> the JimString to a String; it won't convert the 4 to a 5.
>> >> >
>> >> > But but but ... 4 is the *correct* value; if
>> >> > org.random.javaproject.StringTools.insert needs a 5 there to do the
>> >> > right thing, then it is broken -- it doesn't handle full Unicode. If
>> >> > that insert method is coded correctly, then it does not do naive
>> >> > indexing or slicing operations ... it traverses the string one
>> >> > character at a time and inserts the space after the 4th character --
>> >> > *not* the 4th 16-bit "char" -- please read the java.lang.Character
>> >> > API
>> >> > that I posted previously. Keep in mind that java.lang.String itself
>> >> > is
>> >> > not broken, it's a valid Unicode representation (UTF-16), so
>> >> > converting another valid Unicode representation, whether js is UTF-8
>> >> > or UTF-16 or UTF-32, to java.lang.String is not a problem.
>> >> >
>> >> >> And ironically, the whole thing would have worked fine if we had
>> >> >> never
>> >> >> invented JimString...
>> >> >
>> >> > Well, no, it would not have "worked fine", because being broken for
>> >> > Unicode supplemental characters is not "fine".
>> >> >
>> >> >> I think the moral of the story is that it's too late to fix Java
>> >> >> Strings.
>> >> >
>> >> > It isn't necessary to fix Java Strings -- it's String *handling*, and
>> >> > the Java char and Scala Char types, that are broken. *Optionally* one
>> >> > could change the default String type to be UTF-8 or UTF-32, because
>> >> > UTF-16 is suboptimal for both space and speed (unless one brokenly
>> >> > treats UTF-16 as an indexable array), but it isn't necessary for
>> >> > correctness.
>> >> >
>> >> >> That's not to say there isn't a need for a better Unicode library,
>> >> >> but
>> >> >> it shouldn't be
>> >> >> seamless (seams make subtle boundaries clear) or the default,
>> >> >
>> >> > The current default is *broken* and results in all Scala code that
>> >> > operates on Strings as a collection of characters being broken. At
>> >> > the
>> >> > very least we need a transition path from that broken default to
>> >> > something that isn't broken.
>> >> >
>> >> >> and a prerequisite of using it
>> >> >> should be a good understanding of the issues surrounding Java
>> >> >> Strings.
>> >> >
>> >> > Yes, indeed, but that's something that some of the discussants here
>> >> > clearly lack.
>> >> >
>> >
>> >
>> >
>> > --
>> > Daniel C. Sobral
>> >
>> > I travel to the future all the time.
>> >
>
>
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
>

Thu, 2010-12-23, 19:07

#18

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

P.S.

Nothing you wrote below contradicts what I wrote that you are
responding to, which said nothing about changing what types Char and
String are (although I do think that the default should change) -- I
don't know what your "that" in "_cannot_ change that") refers to, but
it's not to anything in what I wrote. What I wrote was that the type
of character constants should change (to UTF32Char or the equivalent),
and that you should be able to traverse a Java String (which is
properly seen as a UTF-16 encoding, *not* an array of chars) as a
sequence of 32-bit Unicode characters (UTF32Char or the equivalent).
Such a change would be entirely within the Scala library. As it is,
StringOps provides the wrong abstraction.

On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral wrote:
> Scala _cannot_ change that, for two reasons:
> 1. A Char is an AnyVal, and any replacement would have to be an AnyRef. This
> has consequences all the way down to the turtles.
> 2. Char and String are supposed to be the same in Scala as in Java,
> precisely so there's no "translation layer" between Java and Scala.
> If Odersky had not opted to make Scala interoperate with Java so
> transparently, then it might have been another matter. As it is, if you get
> a Char interface from Java, then you pass a Char to it, and vice versa.
> So, Char will always be a Java's char and String will always be a Java's
> String. Any improvement must be done without changing this. One could have
> an UTF32String and UTF32Char, for example, and all methods that come with
> it, plus implicit conversions to and from String, and from Char.
>
> On Wed, Dec 22, 2010 at 21:18, Jim Balter wrote:
>>
>> Again, read the java.lang.Character API, with its many codePoint
>> methods. As I said, you can code Java string handling correctly *if*
>> you assiduously avoid the char type and slicing or indexing strings.
>> And again, as I said, you can code it correctly in Scala if you use
>> those java.lang.Character methods -- but that's even more horrible to
>> do in Scala than in Java, where such imperative code is standard. In
>> Scala, you should be able to traverse a Java string as a sequence of
>> 32-bit Unicode characters, even though it is *encoded* as an array of
>> 16-bit UTF-16 code points. And you should be able to naturally
>> (functionally) traverse UTF-8 and UTF-32 encodings as well. And
>> character constants should be 32-bit entities, not 16-bit entities
>> (that's broken in Java too, but Java will never change whereas Scala
>> still can).
>>
>> On Wed, Dec 22, 2010 at 8:40 AM, Daniel Sobral wrote:
>> > Woah! Wait a second!
>> >
>> > There isn't a "Scala's current indexOf method on String", because String
>> > is
>> > not a Scala class, and, naturally, neither is its indexOf method. It's
>> > JAVA.
>> > If you say Java got fixed, then it got fixed for both Java and Scala.
>> >
>> > Both Char and String are _Java_.
>> >
>> > So I don't understand how can you say that Java is getting fixed but
>> > Scala
>> > is stuck. Would you mind providing an example?
>> >
>> > On Tue, Dec 21, 2010 at 22:25, Jim Balter wrote:
>> >>
>> >> P.S. A moral that can be drawn from Jon's example is that, as more and
>> >> more Java code is converted to handle Strings properly by using the
>> >> codePoint methods that were added to java.lang.Character in 1.5,
>> >> interoperability between Scala and Java will break down. Given a
>> >> *properly* coded Java insert method and Scala's current indexOf method
>> >> on Strings, or v.v., you will get the wrong result for Strings that
>> >> contain supplemental characters. Strings are not arrays of fixed-size
>> >> chars, they are sequences of variable-size characters, and commercial
>> >> Java programmers now know that and have the tools to program them
>> >> properly -- but laboriously. But the only tools Scala programmers have
>> >> are those awful java.lang.Character methods; natural Scala approaches
>> >> give the wrong results.
>> >>
>> >>
>> >> On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter wrote:
>> >> > On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com>
>> >> > wrote:
>> >> >> Arya Irani wrote:
>> >> >>> Hmm, ok, good point. Where did myJavaString come from though?
>> >> >>
>> >> >> It could have come form /so many/ different places... ;-)
>> >> >>
>> >> >>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> >> >>> scala.JimString, and except (or unboxed) where java.lang.String is
>> >> >>> required. Thoughts?
>> >> >>
>> >> >> You could do that. But I'm playing devil's advocate, and I think
>> >> >> the
>> >> >> problem is deeper than
>> >> >> that. Consider the following:
>> >> >>
>> >> >> val js : JimString = "Cafébabe" // let's pretend é takes up two
>> >> >> code points
>> >> >>
>> >> >> org.random.javaproject.StringTools.insert(js, " ",
>> >> >> js.indexOf("babe"))
>> >> >>
>> >> >> I would like the insert method to give me the string "Café babe",
>> >> >> but
>> >> >> no matter how correct
>> >> >> the implementation of indexOf for JimString is, the implicit
>> >> >> conversion
>> >> >> will only convert
>> >> >> the JimString to a String; it won't convert the 4 to a 5.
>> >> >
>> >> > But but but ... 4 is the *correct* value; if
>> >> > org.random.javaproject.StringTools.insert needs a 5 there to do the
>> >> > right thing, then it is broken -- it doesn't handle full Unicode. If
>> >> > that insert method is coded correctly, then it does not do naive
>> >> > indexing or slicing operations ... it traverses the string one
>> >> > character at a time and inserts the space after the 4th character --
>> >> > *not* the 4th 16-bit "char" -- please read the java.lang.Character
>> >> > API
>> >> > that I posted previously. Keep in mind that java.lang.String itself
>> >> > is
>> >> > not broken, it's a valid Unicode representation (UTF-16), so
>> >> > converting another valid Unicode representation, whether js is UTF-8
>> >> > or UTF-16 or UTF-32, to java.lang.String is not a problem.
>> >> >
>> >> >> And ironically, the whole thing would have worked fine if we had
>> >> >> never
>> >> >> invented JimString...
>> >> >
>> >> > Well, no, it would not have "worked fine", because being broken for
>> >> > Unicode supplemental characters is not "fine".
>> >> >
>> >> >> I think the moral of the story is that it's too late to fix Java
>> >> >> Strings.
>> >> >
>> >> > It isn't necessary to fix Java Strings -- it's String *handling*, and
>> >> > the Java char and Scala Char types, that are broken. *Optionally* one
>> >> > could change the default String type to be UTF-8 or UTF-32, because
>> >> > UTF-16 is suboptimal for both space and speed (unless one brokenly
>> >> > treats UTF-16 as an indexable array), but it isn't necessary for
>> >> > correctness.
>> >> >
>> >> >> That's not to say there isn't a need for a better Unicode library,
>> >> >> but
>> >> >> it shouldn't be
>> >> >> seamless (seams make subtle boundaries clear) or the default,
>> >> >
>> >> > The current default is *broken* and results in all Scala code that
>> >> > operates on Strings as a collection of characters being broken. At
>> >> > the
>> >> > very least we need a transition path from that broken default to
>> >> > something that isn't broken.
>> >> >
>> >> >> and a prerequisite of using it
>> >> >> should be a good understanding of the issues surrounding Java
>> >> >> Strings.
>> >> >
>> >> > Yes, indeed, but that's something that some of the discussants here
>> >> > clearly lack.
>> >> >
>> >
>> >
>> >
>> > --
>> > Daniel C. Sobral
>> >
>> > I travel to the future all the time.
>> >
>
>
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
>

Thu, 2010-12-23, 21:17

#19

Tony Morris 2

Joined: 2009-03-20,

Re: Support for Ropes in Scala]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 03:51, Jim Balter wrote:
> On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral
> wrote:
>> Scala _cannot_ change that, for two reasons: 1. A Char is an
>> AnyVal, and any replacement would have to be an AnyRef. This has
>> consequences all the way down to the turtles. 2. Char and String
>> are supposed to be the same in Scala as in Java, precisely so
>> there's no "translation layer" between Java and Scala. If Odersky
>> had not opted to make Scala interoperate with Java so
>> transparently, then it might have been another matter. As it is,
>> if you get a Char interface from Java, then you pass a Char to
>> it, and vice versa. So, Char will always be a Java's char and
>> String will always be a Java's String. Any improvement must be
>> done without changing this. One could have an UTF32String and
>> UTF32Char, for example, and all methods that come with it, plus
>> implicit conversions to and from String, and from Char.
>
> I disagree that it can't be changed, and will be working on a
> design and a proposal. And I think that, eventually, it must
> change, and so the sooner the better. No longer does Scala
> interoperate with Java because the Java people woke up and realized
> how broken it is. All code that deals with Java chars is broken;
> eventually all use of it will be abandoned, except by amateurs who
> don't know what they're doing, and others who don't care whether
> their code is broken . The new Java "char", as can be seen from the
> java.lang.Character methods, is called "int".
>
There are plenty of languages that do not have the burden of
interoperability with Java. Most of my day job is using one of these

Thu, 2010-12-23, 21:27

#20

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 12:11 PM, Tony Morris wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 24/12/10 03:51, Jim Balter wrote:
>> On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral
>> wrote:
>>> Scala _cannot_ change that, for two reasons: 1. A Char is an
>>> AnyVal, and any replacement would have to be an AnyRef. This has
>>> consequences all the way down to the turtles. 2. Char and String
>>> are supposed to be the same in Scala as in Java, precisely so
>>> there's no "translation layer" between Java and Scala. If Odersky
>>> had not opted to make Scala interoperate with Java so
>>> transparently, then it might have been another matter. As it is,
>>> if you get a Char interface from Java, then you pass a Char to
>>> it, and vice versa. So, Char will always be a Java's char and
>>> String will always be a Java's String. Any improvement must be
>>> done without changing this. One could have an UTF32String and
>>> UTF32Char, for example, and all methods that come with it, plus
>>> implicit conversions to and from String, and from Char.
>>
>> I disagree that it can't be changed, and will be working on a
>> design and a proposal. And I think that, eventually, it must
>> change, and so the sooner the better. No longer does Scala
>> interoperate with Java because the Java people woke up and realized
>> how broken it is. All code that deals with Java chars is broken;
>> eventually all use of it will be abandoned, except by amateurs who
>> don't know what they're doing, and others who don't care whether
>> their code is broken . The new Java "char", as can be seen from the
>> java.lang.Character methods, is called "int".
>>
> There are plenty of languages that do not have the burden of
> interoperability with Java. Most of my day job is using one of these
> - -- it's called haskell.
>
> If you're going to "fix" this bit of scala, at the expense of
> interoperability, why not fix a lot of things now that you have thrown
> away the shackles?

Why not take the time to actually understand what has been said?

Thu, 2010-12-23, 22:17

#21

Erik Engbrecht

Joined: 2008-12-19,

Re: Support for Ropes in Scala]

I think Tony's comment makes perfect sense in the context of this thread.
On Thu, Dec 23, 2010 at 3:15 PM, Jim Balter <Jim@balter.name> wrote:

>
> If you're going to "fix" this bit of scala, at the expense of
> interoperability, why not fix a lot of things now that you have thrown
> away the shackles?

Why not take the time to actually understand what has been said?

Thu, 2010-12-23, 22:27

#22

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

Like his snipe, yours is not helpful. I would ask you the same
question I asked him. String handling is broken, and that won't go
away by putting "fix" in scare quotes. And a fix would not be "at the
expense of interoperability" -- that shows a complete misunderstanding
of the issue and of what I have written. I am not proposing "throwing
away shackles" -- again a complete misunderstanding.

On Thu, Dec 23, 2010 at 1:07 PM, Erik Engbrecht
wrote:
> I think Tony's comment makes perfect sense in the context of this thread.
> On Thu, Dec 23, 2010 at 3:15 PM, Jim Balter wrote:
>>
>> >
>> > If you're going to "fix" this bit of scala, at the expense of
>> > interoperability, why not fix a lot of things now that you have thrown
>> > away the shackles?
>>
>> Why not take the time to actually understand what has been said?
>
>

Thu, 2010-12-23, 23:07

#23

Kevin Wright 2

Joined: 2010-05-30,

Re: Support for Ropes in Scala]

I'm truly sorry, but if I didn't do this then someone else would.

http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html

There's actually a serious message here. However efficient haskell may be, is it really good enough to allow us to rewrite literally centuries of cumulative work, overnight, that's already in production? If not, then we'll continue to need to work with existing systems, and this kind of interop is essential.

By all means, we should evolve the languages and tools that we work with, but to just abandon what has gone before is the worst kind of disrespect...

On 23 Dec 2010 20:11, "Tony Morris" <tonymorris@gmail.com> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 24/12/10 03:51, Jim Balter wrote:
>> On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral <dcsobral@gmail.com>
>> wrote:
>>> Scala _cannot_ change that, for two reasons: 1. A Char is an
>>> AnyVal, and any replacement would have to be an AnyRef. This has
>>> consequences all the way down to the turtles. 2. Char and String
>>> are supposed to be the same in Scala as in Java, precisely so
>>> there's no "translation layer" between Java and Scala. If Odersky
>>> had not opted to make Scala interoperate with Java so
>>> transparently, then it might have been another matter. As it is,
>>> if you get a Char interface from Java, then you pass a Char to
>>> it, and vice versa. So, Char will always be a Java's char and
>>> String will always be a Java's String. Any improvement must be
>>> done without changing this. One could have an UTF32String and
>>> UTF32Char, for example, and all methods that come with it, plus
>>> implicit conversions to and from String, and from Char.
>>
>> I disagree that it can't be changed, and will be working on a
>> design and a proposal. And I think that, eventually, it must
>> change, and so the sooner the better. No longer does Scala
>> interoperate with Java because the Java people woke up and realized
>> how broken it is. All code that deals with Java chars is broken;
>> eventually all use of it will be abandoned, except by amateurs who
>> don't know what they're doing, and others who don't care whether
>> their code is broken . The new Java "char", as can be seen from the
>> java.lang.Character methods, is called "int".
>>
> There are plenty of languages that do not have the burden of
> interoperability with Java. Most of my day job is using one of these
> - -- it's called haskell.
>
> If you're going to "fix" this bit of scala, at the expense of
> interoperability, why not fix a lot of things now that you have thrown
> away the shackles?
>
> - --
> Tony Morris
> http://tmorris.net/
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk0TrQYACgkQmnpgrYe6r60yhwCgylilakdvL2YDVejGYb032ZET
> gc4AnRsNJMb9HTSdB7cUsl5z8EoXoCL3
> =sTt4
> -----END PGP SIGNATURE-----
>

Thu, 2010-12-23, 23:17

#24

Randall R Schulz

Joined: 2008-12-16,

Re: Support for Ropes in Scala]

On Thursday December 23 2010, Jim Balter wrote:
> Like his snipe, yours is not helpful. I would ask you the same
> question I asked him. String handling is broken, and that won't go
> away by putting "fix" in scare quotes. And a fix would not be "at the
> expense of interoperability" -- that shows a complete
> misunderstanding of the issue and of what I have written. I am not
> proposing "throwing away shackles" -- again a complete
> misunderstanding.

Isn't it time for you to set about crafting the fix you propose?
Is there more to be profitably debated here?

Randall Schulz

Thu, 2010-12-23, 23:27

#25

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

I already said that I will be working on a proposal ... in fact I have
started on it, so your comment makes unwarranted assumptions. But
certainly there's nothing profitable from the sort of "debate" that
Tony and Erik seem to want to have, nor from your metacomment. If you
don't want to discuss the technical issues, fine, don't read further
and don't comment further. But I have found the discussion with people
like Jon Pretty and Daniel Sobral, who did address technical issues,
to be helpful.

On Thu, Dec 23, 2010 at 2:10 PM, Randall R Schulz wrote:
> On Thursday December 23 2010, Jim Balter wrote:
>> Like his snipe, yours is not helpful. I would ask you the same
>> question I asked him. String handling is broken, and that won't go
>> away by putting "fix" in scare quotes. And a fix would not be "at the
>> expense of interoperability" -- that shows a complete
>> misunderstanding of the issue and of what I have written. I am not
>> proposing "throwing away shackles" -- again a complete
>> misunderstanding.
>
> Isn't it time for you to set about crafting the fix you propose?
> Is there more to be profitably debated here?
>
>
> Randall Schulz
>

Thu, 2010-12-23, 23:47

#26

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

Who is this comment aimed at? Who is talking about abandoning what has
gone before? I'm certainly not. But when the Unicode people extended
the number of characters so that they no longer all fit in a 15-bit
word, they broke very large amounts of code (particularly Java code),
and that has consequences. It's very unfortunate that the design of
Scala ignored this major issue (which later was addressed, however
poorly, in Java 1.5), so that, as a result, Scala code is broken as
well, but that has consequences too.

Scala has excellent facilities for evolutionary interop, and they
should of course be employed in addressing the problem. Talk about
"disrespect" does not address anything useful.

On Thu, Dec 23, 2010 at 1:58 PM, Kevin Wright wrote:
> I'm truly sorry, but if I didn't do this then someone else would.
>
> http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-dis...
>
> There's actually a serious message here. However efficient haskell may be,
> is it really good enough to allow us to rewrite literally centuries of
> cumulative work, overnight, that's already in production? If not, then we'll
> continue to need to work with existing systems, and this kind of interop is
> essential.
>
> By all means, we should evolve the languages and tools that we work with,
> but to just abandon what has gone before is the worst kind of disrespect...
>
> On 23 Dec 2010 20:11, "Tony Morris" wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 24/12/10 03:51, Jim Balter wrote:
>>> On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral
>>> wrote:
>>>> Scala _cannot_ change that, for two reasons: 1. A Char is an
>>>> AnyVal, and any replacement would have to be an AnyRef. This has
>>>> consequences all the way down to the turtles. 2. Char and String
>>>> are supposed to be the same in Scala as in Java, precisely so
>>>> there's no "translation layer" between Java and Scala. If Odersky
>>>> had not opted to make Scala interoperate with Java so
>>>> transparently, then it might have been another matter. As it is,
>>>> if you get a Char interface from Java, then you pass a Char to
>>>> it, and vice versa. So, Char will always be a Java's char and
>>>> String will always be a Java's String. Any improvement must be
>>>> done without changing this. One could have an UTF32String and
>>>> UTF32Char, for example, and all methods that come with it, plus
>>>> implicit conversions to and from String, and from Char.
>>>
>>> I disagree that it can't be changed, and will be working on a
>>> design and a proposal. And I think that, eventually, it must
>>> change, and so the sooner the better. No longer does Scala
>>> interoperate with Java because the Java people woke up and realized
>>> how broken it is. All code that deals with Java chars is broken;
>>> eventually all use of it will be abandoned, except by amateurs who
>>> don't know what they're doing, and others who don't care whether
>>> their code is broken . The new Java "char", as can be seen from the
>>> java.lang.Character methods, is called "int".
>>>
>> There are plenty of languages that do not have the burden of
>> interoperability with Java. Most of my day job is using one of these
>> - -- it's called haskell.
>>
>> If you're going to "fix" this bit of scala, at the expense of
>> interoperability, why not fix a lot of things now that you have thrown
>> away the shackles?
>>
>> - --
>> Tony Morris
>> http://tmorris.net/
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.10 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAk0TrQYACgkQmnpgrYe6r60yhwCgylilakdvL2YDVejGYb032ZET
>> gc4AnRsNJMb9HTSdB7cUsl5z8EoXoCL3
>> =sTt4
>> -----END PGP SIGNATURE-----
>>
>

Fri, 2010-12-24, 00:27

#27

Tony Morris 2

Joined: 2009-03-20,

Re: Support for Ropes in Scala]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 06:15, Jim Balter wrote:
> On Thu, Dec 23, 2010 at 12:11 PM, Tony Morris
> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> On 24/12/10 03:51, Jim Balter wrote:
>>> On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral
>>> wrote:
>>>> Scala _cannot_ change that, for two reasons: 1. A Char is an
>>>> AnyVal, and any replacement would have to be an AnyRef. This
>>>> has consequences all the way down to the turtles. 2. Char and
>>>> String are supposed to be the same in Scala as in Java,
>>>> precisely so there's no "translation layer" between Java and
>>>> Scala. If Odersky had not opted to make Scala interoperate
>>>> with Java so transparently, then it might have been another
>>>> matter. As it is, if you get a Char interface from Java, then
>>>> you pass a Char to it, and vice versa. So, Char will always
>>>> be a Java's char and String will always be a Java's String.
>>>> Any improvement must be done without changing this. One could
>>>> have an UTF32String and UTF32Char, for example, and all
>>>> methods that come with it, plus implicit conversions to and
>>>> from String, and from Char.
>>>
>>> I disagree that it can't be changed, and will be working on a
>>> design and a proposal. And I think that, eventually, it must
>>> change, and so the sooner the better. No longer does Scala
>>> interoperate with Java because the Java people woke up and
>>> realized how broken it is. All code that deals with Java chars
>>> is broken; eventually all use of it will be abandoned, except
>>> by amateurs who don't know what they're doing, and others who
>>> don't care whether their code is broken . The new Java "char",
>>> as can be seen from the java.lang.Character methods, is called
>>> "int".
>>>
>> There are plenty of languages that do not have the burden of
>> interoperability with Java. Most of my day job is using one of
>> these - -- it's called haskell.
>>
>> If you're going to "fix" this bit of scala, at the expense of
>> interoperability, why not fix a lot of things now that you have
>> thrown away the shackles?
>
> Why not take the time to actually understand what has been said?
Good point, try it.

Fri, 2010-12-24, 00:27

#28

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 3:19 PM, Tony Morris wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 24/12/10 06:15, Jim Balter wrote:
>> On Thu, Dec 23, 2010 at 12:11 PM, Tony Morris
>> wrote:
>>>
>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>
>>> On 24/12/10 03:51, Jim Balter wrote:
>>>> On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral
>>>> wrote:
>>>>> Scala _cannot_ change that, for two reasons: 1. A Char is an
>>>>> AnyVal, and any replacement would have to be an AnyRef. This
>>>>> has consequences all the way down to the turtles. 2. Char and
>>>>> String are supposed to be the same in Scala as in Java,
>>>>> precisely so there's no "translation layer" between Java and
>>>>> Scala. If Odersky had not opted to make Scala interoperate
>>>>> with Java so transparently, then it might have been another
>>>>> matter. As it is, if you get a Char interface from Java, then
>>>>> you pass a Char to it, and vice versa. So, Char will always
>>>>> be a Java's char and String will always be a Java's String.
>>>>> Any improvement must be done without changing this. One could
>>>>> have an UTF32String and UTF32Char, for example, and all
>>>>> methods that come with it, plus implicit conversions to and
>>>>> from String, and from Char.
>>>>
>>>> I disagree that it can't be changed, and will be working on a
>>>> design and a proposal. And I think that, eventually, it must
>>>> change, and so the sooner the better. No longer does Scala
>>>> interoperate with Java because the Java people woke up and
>>>> realized how broken it is. All code that deals with Java chars
>>>> is broken; eventually all use of it will be abandoned, except
>>>> by amateurs who don't know what they're doing, and others who
>>>> don't care whether their code is broken . The new Java "char",
>>>> as can be seen from the java.lang.Character methods, is called
>>>> "int".
>>>>
>>> There are plenty of languages that do not have the burden of
>>> interoperability with Java. Most of my day job is using one of
>>> these - -- it's called haskell.
>>>
>>> If you're going to "fix" this bit of scala, at the expense of
>>> interoperability, why not fix a lot of things now that you have
>>> thrown away the shackles?
>>
>> Why not take the time to actually understand what has been said?
> Good point, try it.

I 'm glad to know that you understand what I've said better than I do.

Fri, 2010-12-24, 00:47

#29

Tony Morris 2

Joined: 2009-03-20,

Re: Support for Ropes in Scala]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 09:26, Jim Balter wrote:
>
> I 'm glad to know that you understand what I've said better than I
> do.

What's with the snarky crap?

Fri, 2010-12-24, 00:47

#30

Tony Morris 2

Joined: 2009-03-20,

[Serious messages]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

There is no serious message here...

...except for perhaps an examination on why Yegge insists on offering
comment on topics of which he not even the most fundamental
understanding. Let's be clear; he has not even the most basic
understanding of many of the relevant topics. I could list them and
that man would be completely baffled. In fact, I have done exactly
that before.

It is demonstrable by historical commentary, which has remained
immutable to this day. Further, Yegge gives this all away accidentally
and unknowingly. Please do not take this man seriously. *Please*, for
your sake.

I use Haskell almost all day. I work for a Java product company. There
is a reason I use Haskell all day and it has something to do with
productivity. Recently, myself and a colleague had to write a program
for our Java programmers, so I wrote it in Scala and told them it was
Java bytecode with a really great editor that has a special save
function called scalac. I use many languages when this "Java"
requirement doesn't come up, including Haskell. Think about this. Please.

I didn't intend to hijack this thread with this topic and I really do
not like attacking people, but if you're going to spread a "serious
message", then please do not refer to such catastrophic departure from
reality on behalf of that clueless man. He is not to be taken
seriously, so I will accept your introductory apology, even if out of
context.

I'm more interested in an actual serious message than any foreseeable
debate on the contentious issues that I may have raised. I hope the
risk works out...

On 24/12/10 07:58, Kevin Wright wrote:
>
> I'm truly sorry, but if I didn't do this then someone else would.
>
> http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-dis...
>
> There's actually a serious message here. However efficient haskell
> may be, is it really good enough to allow us to rewrite literally
> centuries of cumulative work, overnight, that's already in
> production? If not, then we'll continue to need to work with
> existing systems, and this kind of interop is essential.
>
> By all means, we should evolve the languages and tools that we work
> with, but to just abandon what has gone before is the worst kind of
> disrespect...
>

Fri, 2010-12-24, 01:17

#31

ichoran

Joined: 2009-08-14,

Re: Support for Ropes in Scala]

On Wed, Dec 22, 2010 at 6:18 PM, Jim Balter <Jim@balter.name> wrote:

Again, read the java.lang.Character API, with its many codePoint
methods. As I said, you can code Java string handling correctly *if*
you assiduously avoid the char type and slicing or indexing strings.

So we know what to do.

And again, as I said, you can code it correctly in Scala if you use
those java.lang.Character methods -- but that's even more horrible to
do in Scala than in Java, where such imperative code is standard.

This sounds like a case for a library.

In
Scala, you should be able to traverse a Java string as a sequence of
32-bit Unicode characters, even though it is *encoded* as an array of
16-bit UTF-16 code points. And you should be able to naturally
(functionally) traverse UTF-8 and UTF-32 encodings as well.

This really sounds like a case for a library. If you care enough, you should write one. If you don't care enough, let people who do care enough make such arguments instead of you. They, after having tried to write one, will probably understand the issues better, and will make better recommendations.

And
character constants should be 32-bit entities, not 16-bit entities

(that's broken in Java too, but Java will never change whereas Scala
still can).

If Scala changes, Java-Scala iterop breaks. But you can always write a unicode-aware layer on top of Java chars and Strings, and use that instead. If you're careful, you can make it look to Java-land like whatever the consensus solution is in Java-land, and make it extra-pretty in Scala. Implicit conversions are your friend.

So why not _do_ it instead of squabbling about it?

(Note: I said _instead of_ not _as well as_. Code speaks louder than words.)

--Rex

On Wed, Dec 22, 2010 at 8:40 AM, Daniel Sobral <dcsobral@gmail.com> wrote:
> Woah! Wait a second!
>
> There isn't a "Scala's current indexOf method on String", because String is
> not a Scala class, and, naturally, neither is its indexOf method. It's JAVA.
> If you say Java got fixed, then it got fixed for both Java and Scala.
>
> Both Char and String are _Java_.
>
> So I don't understand how can you say that Java is getting fixed but Scala
> is stuck. Would you mind providing an example?
>
> On Tue, Dec 21, 2010 at 22:25, Jim Balter <Jim@balter.name> wrote:
>>
>> P.S. A moral that can be drawn from Jon's example is that, as more and
>> more Java code is converted to handle Strings properly by using the
>> codePoint methods that were added to java.lang.Character in 1.5,
>> interoperability between Scala and Java will break down. Given a
>> *properly* coded Java insert method and Scala's current indexOf method
>> on Strings, or v.v., you will get the wrong result for Strings that
>> contain supplemental characters. Strings are not arrays of fixed-size
>> chars, they are sequences of variable-size characters, and commercial
>> Java programmers now know that and have the tools to program them
>> properly -- but laboriously. But the only tools Scala programmers have
>> are those awful java.lang.Character methods; natural Scala approaches
>> give the wrong results.
>>
>>
>> On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter <Jim@balter.name> wrote:
>> > On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com>
>> > wrote:
>> >> Arya Irani wrote:
>> >>> Hmm, ok, good point. Where did myJavaString come from though?
>> >>
>> >> It could have come form /so many/ different places... ;-)
>> >>
>> >>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> >>> scala.JimString, and except (or unboxed) where java.lang.String is
>> >>> required. Thoughts?
>> >>
>> >> You could do that. But I'm playing devil's advocate, and I think the
>> >> problem is deeper than
>> >> that. Consider the following:
>> >>
>> >> val js : JimString = "Cafébabe" // let's pretend é takes up two
>> >> code points
>> >>
>> >> org.random.javaproject.StringTools.insert(js, " ",
>> >> js.indexOf("babe"))
>> >>
>> >> I would like the insert method to give me the string "Café babe", but
>> >> no matter how correct
>> >> the implementation of indexOf for JimString is, the implicit conversion
>> >> will only convert
>> >> the JimString to a String; it won't convert the 4 to a 5.
>> >
>> > But but but ... 4 is the *correct* value; if
>> > org.random.javaproject.StringTools.insert needs a 5 there to do the
>> > right thing, then it is broken -- it doesn't handle full Unicode. If
>> > that insert method is coded correctly, then it does not do naive
>> > indexing or slicing operations ... it traverses the string one
>> > character at a time and inserts the space after the 4th character --
>> > *not* the 4th 16-bit "char" -- please read the java.lang.Character API
>> > that I posted previously. Keep in mind that java.lang.String itself is
>> > not broken, it's a valid Unicode representation (UTF-16), so
>> > converting another valid Unicode representation, whether js is UTF-8
>> > or UTF-16 or UTF-32, to java.lang.String is not a problem.
>> >
>> >> And ironically, the whole thing would have worked fine if we had never
>> >> invented JimString...
>> >
>> > Well, no, it would not have "worked fine", because being broken for
>> > Unicode supplemental characters is not "fine".
>> >
>> >> I think the moral of the story is that it's too late to fix Java
>> >> Strings.
>> >
>> > It isn't necessary to fix Java Strings -- it's String *handling*, and
>> > the Java char and Scala Char types, that are broken. *Optionally* one
>> > could change the default String type to be UTF-8 or UTF-32, because
>> > UTF-16 is suboptimal for both space and speed (unless one brokenly
>> > treats UTF-16 as an indexable array), but it isn't necessary for
>> > correctness.
>> >
>> >> That's not to say there isn't a need for a better Unicode library, but
>> >> it shouldn't be
>> >> seamless (seams make subtle boundaries clear) or the default,
>> >
>> > The current default is *broken* and results in all Scala code that
>> > operates on Strings as a collection of characters being broken. At the
>> > very least we need a transition path from that broken default to
>> > something that isn't broken.
>> >
>> >> and a prerequisite of using it
>> >> should be a good understanding of the issues surrounding Java Strings.
>> >
>> > Yes, indeed, but that's something that some of the discussants here
>> > clearly lack.
>> >
>
>
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
>

Fri, 2010-12-24, 01:27

#32

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 4:09 PM, Rex Kerr wrote:
> On Wed, Dec 22, 2010 at 6:18 PM, Jim Balter wrote:
>>
>> Again, read the java.lang.Character API, with its many codePoint
>> methods. As I said, you can code Java string handling correctly *if*
>> you assiduously avoid the char type and slicing or indexing strings.
>
> So we know what to do.
>
>>
>> And again, as I said, you can code it correctly in Scala if you use
>> those java.lang.Character methods -- but that's even more horrible to
>> do in Scala than in Java, where such imperative code is standard.
>
> This sounds like a case for a library.

Yes, I know.

>>
>> In
>> Scala, you should be able to traverse a Java string as a sequence of
>> 32-bit Unicode characters, even though it is *encoded* as an array of
>> 16-bit UTF-16 code points. And you should be able to naturally
>> (functionally) traverse UTF-8 and UTF-32 encodings as well.
>
> This really sounds like a case for a library.

Yes, I know.

> If you care enough, you
> should write one. If you don't care enough, let people who do care enough
> make such arguments instead of you. They, after having tried to write one,
> will probably understand the issues better, and will make better
> recommendations.
>
>>
>> And
>> character constants should be 32-bit entities, not 16-bit entities
>>
>> (that's broken in Java too, but Java will never change whereas Scala
>> still can).
>
> If Scala changes, Java-Scala iterop breaks.

There are changes, and then there are changes. For instance, changing
StringOps to implement the correct abstraction would not break interop
with Java.

> But you can always write a
> unicode-aware layer on top of Java chars and Strings, and use that instead.
> If you're careful, you can make it look to Java-land like whatever the
> consensus solution is in Java-land, and make it extra-pretty in Scala.
> Implicit conversions are your friend.

However, Jon Pretty pointed out that implicit conversions may not
always be invoked when you want them to be -- they only are when the
code doesn't type check. This may put serious restrictions on a
solution.

> So why not _do_ it instead of squabbling about it?

I am doing it.

> (Note: I said _instead of_ not _as well as_. Code speaks louder than
> words.)

That is rather hypocritical of you, who are *only* squabbling. I think
discussing the issues is worthwhile; if you don't want to hear from
me, there are mechanisms available to you to avoid doing so.

Fri, 2010-12-24, 01:47

#33

Donna Malayeri

Joined: 2009-10-21,

Re: [Serious messages]

Agreed, Tony. His post wasn't even amusing, which would have been its only redeeming quality. (And I'm not particularly invested in Haskell, given that I've never written a single line of code in the language.)

Perhaps one could generously attribute a serious message to the long, unfunny rant, and that would probably be: Haskell is scary to many programmers. This is not a problem faced by Haskell alone, by any means, but has been helped along by the long tradition of programming being taught as the mere fact of learning C or C++ and later Java (yes, what a leap forward). The comments at the end of the article, all extolling the wit and virtues of the esteemed pundit, just underscore this fact: sadly, most professional programmers are extremely undereducated when it comes to the matter of programming languages (well, that, and software engineering, too, but that's a different argument entirely.)

Incidentally, afaik Haskell doesn't have a good a story for interop as do Scala or F# (please correct me if I'm wrong), but, say, Perl doesn't have great cross-language interop either! But I don't see much criticism regarding interop in that silly blog post, so if that was his point, he didn't make it very clearly.

At any rate, I make the (bold) claim that the main reason that a particular language is or is not used is a business--rather than a technical--matter. Managers at a big software company are less likely to take the bold risks of startups and smaller companies. I gave a guest lecture on this for an undergraduate software engineering course at CMU; slides are up at http://www.cs.cmu.edu/~donna/public/language-and-SE.pdf (see pdf page 8 onward).

I think there is certainly room for an interesting debate here, though it would be ironic if that blog post were the instigating factor.

Donna

On Fri, Dec 24, 2010 at 12:30 AM, Tony Morris <tonymorris@gmail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

There is no serious message here...

...except for perhaps an examination on why Yegge insists on offering
comment on topics of which he not even the most fundamental
understanding. Let's be clear; he has not even the most basic
understanding of many of the relevant topics. I could list them and
that man would be completely baffled. In fact, I have done exactly
that before.

It is demonstrable by historical commentary, which has remained
immutable to this day. Further, Yegge gives this all away accidentally
and unknowingly. Please do not take this man seriously. *Please*, for
your sake.

I use Haskell almost all day. I work for a Java product company. There
is a reason I use Haskell all day and it has something to do with
productivity. Recently, myself and a colleague had to write a program
for our Java programmers, so I wrote it in Scala and told them it was
Java bytecode with a really great editor that has a special save
function called scalac. I use many languages when this "Java"
requirement doesn't come up, including Haskell. Think about this. Please.

I didn't intend to hijack this thread with this topic and I really do
not like attacking people, but if you're going to spread a "serious
message", then please do not refer to such catastrophic departure from
reality on behalf of that clueless man. He is not to be taken
seriously, so I will accept your introductory apology, even if out of
context.

I'm more interested in an actual serious message than any foreseeable
debate on the contentious issues that I may have raised. I hope the
risk works out...

On 24/12/10 07:58, Kevin Wright wrote:
>
> I'm truly sorry, but if I didn't do this then someone else would.
>
> http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html
>
> There's actually a serious message here. However efficient haskell
> may be, is it really good enough to allow us to rewrite literally
> centuries of cumulative work, overnight, that's already in
> production? If not, then we'll continue to need to work with
> existing systems, and this kind of interop is essential.
>
> By all means, we should evolve the languages and tools that we work
> with, but to just abandon what has gone before is the worst kind of
> disrespect...
>

- --
Tony Morris
http://tmorris.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0T250ACgkQmnpgrYe6r62afACgkhG9LMYvXupiSY8VIen5N/7M
sYoAoIT00z2mdPBYIQRjj1NLsTZ2VIwx
=L0F6
-----END PGP SIGNATURE-----

Fri, 2010-12-24, 01:57

#34

Tony Morris 2

Joined: 2009-03-20,

Re: [Serious messages]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 10:43, Donna Malayeri wrote:
> Agreed, Tony. His post wasn't even amusing, which would have been
> its only redeeming quality. (And I'm not particularly invested in
> Haskell, given that I've never written a single line of code in
> the language.)

Thank you for reassuring/reminding me of the existence of independent
thinkers out there :)

>
> Perhaps one could generously attribute a serious message to the
> long, unfunny rant, and that would probably be: Haskell is scary
> to many programmers. This is not a problem faced by Haskell alone,
> by any means, but has been helped along by the long tradition of
> programming being taught as the mere fact of learning C or C++ and
> later Java (yes, what a leap forward). The comments at the end of
> the article, all extolling the wit and virtues of the esteemed
> pundit, just underscore this fact: sadly, most professional
> programmers are extremely undereducated when it comes to the
> matter of programming languages (well, that, and software
> engineering, too, but that's a different argument entirely.)
>
> Incidentally, afaik Haskell doesn't have a good a story for
> interop as do Scala or F# (please correct me if I'm wrong),

You are exactly right.

> but, say, Perl doesn't have great cross-language interop either!
> But I don't see much criticism regarding interop in that silly blog
> post, so if that was his point, he didn't make it very clearly.
>
> At any rate, I make the (bold) claim that the main reason that a
> particular language is or is not used is a business--rather than a
> technical--matter. Managers at a big software company are less
> likely to take the bold risks of startups and smaller companies. I
> gave a guest lecture on this for an undergraduate software
> engineering course at CMU; slides are up at
> http://www.cs.cmu.edu/~donna/public/language-and-SE.pdf
> <http://www.cs.cmu.edu/%7Edonna/public/language-and-SE.pdf> (see
> pdf page 8 onward).

My pessimism leads to even stronger conclusions regarding ineptitude.
After all, how on earth is a person who has *absolutely no clue* (and
I mean this in contrast to say, a beginner who has a bit of a clue)
able to so catastrophically mislead (comical intent aside, although
like you, I also didn't find it funny) many others who also have no
clue in such a way? The ineptitude continues. I am appealing for a
stop to this cycle.

>
> I think there is certainly room for an interesting debate here,
> though it would be ironic if that blog post were the instigating
> factor.

Yes, this is my appeal. *Interesting* (or *serious*) debate. These
adjectives are extremely important, at least to me.

Thanks for the response :)

>
> Donna
>
> On Fri, Dec 24, 2010 at 12:30 AM, Tony Morris
> <tonymorris@gmail.com > wrote:
>
>
> There is no serious message here...
>
> ...except for perhaps an examination on why Yegge insists on
> offering comment on topics of which he not even the most
> fundamental understanding. Let's be clear; he has not even the most
> basic understanding of many of the relevant topics. I could list
> them and that man would be completely baffled. In fact, I have done
> exactly that before.
>
> It is demonstrable by historical commentary, which has remained
> immutable to this day. Further, Yegge gives this all away
> accidentally and unknowingly. Please do not take this man
> seriously. *Please*, for your sake.
>
> I use Haskell almost all day. I work for a Java product company.
> There is a reason I use Haskell all day and it has something to do
> with productivity. Recently, myself and a colleague had to write a
> program for our Java programmers, so I wrote it in Scala and told
> them it was Java bytecode with a really great editor that has a
> special save function called scalac. I use many languages when this
> "Java" requirement doesn't come up, including Haskell. Think about
> this. Please.
>
> I didn't intend to hijack this thread with this topic and I really
> do not like attacking people, but if you're going to spread a
> "serious message", then please do not refer to such catastrophic
> departure from reality on behalf of that clueless man. He is not to
> be taken seriously, so I will accept your introductory apology,
> even if out of context.
>
> I'm more interested in an actual serious message than any
> foreseeable debate on the contentious issues that I may have
> raised. I hope the risk works out...
>
> On 24/12/10 07:58, Kevin Wright wrote:
>
>> I'm truly sorry, but if I didn't do this then someone else
>> would.
>
>
> http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html
>
>
> There's actually a serious message here. However efficient
> haskell
>> may be, is it really good enough to allow us to rewrite
>> literally centuries of cumulative work, overnight, that's already
>> in production? If not, then we'll continue to need to work with
>> existing systems, and this kind of interop is essential.
>
>> By all means, we should evolve the languages and tools that we
> work
>> with, but to just abandon what has gone before is the worst
> kind of
>> disrespect...
>
>
>

- --
Tony Morris
http://tmorris.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0T7mYACgkQmnpgrYe6r63SOwCgniosi0+dyXw+zS6BG8DvLOE5
lFQAoK35A2/0i9XmxKpgxG+oIrN6DFxv
=Uiyr
-----END PGP SIGNATURE-----

Fri, 2010-12-24, 02:07

#35

Warren Henning

Joined: 2008-12-31,

Re: [Serious messages]

Dude, you totally got trolled.

I thought the post was unfunny and stupid, of course.

On Thu, Dec 23, 2010 at 3:30 PM, Tony Morris wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> There is no serious message here...
>
> ...except for perhaps an examination on why Yegge insists on offering
> comment on topics of which he not even the most fundamental
> understanding. Let's be clear; he has not even the most basic
> understanding of many of the relevant topics. I could list them and
> that man would be completely baffled. In fact, I have done exactly
> that before.

Fri, 2010-12-24, 02:17

#36

Tony Morris 2

Joined: 2009-03-20,

Re: [Serious messages]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 10:53, Warren Henning wrote:
> Dude, you totally got trolled.
Not exactly. A clueless person uttered words in such a way as to
display the magnificence of that cluelessness in an attempt to be
comical (which imo was also a failure). That person is renowned for
these displays. Alone, I offer no response.

Then, a person took this seriously. The latter is concerning, the
former is not. I am now compelled to respond.

Fri, 2010-12-24, 02:47

#37

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 4:22 PM, Jim Balter wrote:
> On Thu, Dec 23, 2010 at 4:09 PM, Rex Kerr wrote:

>> If Scala changes, Java-Scala iterop breaks.
>
> There are changes, and then there are changes. For instance, changing
> StringOps to implement the correct abstraction would not break interop
> with Java.

An example concerning Scala-Java interoperability:

The UTF-8 file "supp" contains contains the character

Fri, 2010-12-24, 03:07

#38

Erik Engbrecht

Joined: 2008-12-19,

Re: [Serious messages]

At any rate, I make the (bold) claim that the main reason that a particular language is or is not used is a business--rather than a technical--matter. Managers at a big software company are less likely to take the bold risks of startups and smaller companies.

I think it's more nuanced than that. There are managers and technical leaders who are primarily focused on solving the problems at hand, particularly the hardest problems at hand, as quickly as possible and with a narrow; and you have those who are primarily focused on avoiding future problems and tend to heavily consider the "organizational context" of their decisions.
Startups generally only have the former, because they have to focus on the here-and-now in order to survive and they aren't big enough for "organizational context" to be overly relevant beyond "get the product out the door before we can no longer afford to eat."
Large organizations tend to have both types of people, with the latter being more common - or at least more obvious - than the former.
I work is for a very large company, and I've been both enthusiastically complimented and accused of risking the very foundations of the company for the exact same work. It just depends on the person's temperament, and it goes well beyond selection of programming languages and other tools.

Fri, 2010-12-24, 03:27

#39

dcsobral

Joined: 2009-04-23,

Re: Support for Ropes in Scala]

Changing the type of a character literal would instantly break any code using them, _plus_ make it hard to interact with anything with a Char interface. And what, exactly, would be gained by that which would not be gained by an implicit conversion?
As for how one sees a Java String, if any Scala method returns an index which cannot be then used with a Java method, it would be broken. Most String methods are Java's, and StringOps only offer a complement. Some of them could be changed -- like map -- but at the cost of having StringOps handle a String in two different ways.
As for what a Java String is, even methods such as codePointAt treat it as a sequence of "char".
As for "that", I meant Char and String. They can't be changed because they are Java's, not Scala's.

On Thu, Dec 23, 2010 at 16:03, Jim Balter <Jim@balter.name> wrote:

P.S.

Nothing you wrote below contradicts what I wrote that you are
responding to, which said nothing about changing what types Char and
String are (although I do think that the default should change) -- I
don't know what your "that" in "_cannot_ change that") refers to, but
it's not to anything in what I wrote. What I wrote was that the type
of character constants should change (to UTF32Char or the equivalent),
and that you should be able to traverse a Java String (which is
properly seen as a UTF-16 encoding, *not* an array of chars) as a
sequence of 32-bit Unicode characters (UTF32Char or the equivalent).
Such a change would be entirely within the Scala library. As it is,
StringOps provides the wrong abstraction.

On Thu, Dec 23, 2010 at 4:58 AM, Daniel Sobral <dcsobral@gmail.com> wrote:
> Scala _cannot_ change that, for two reasons:
> 1. A Char is an AnyVal, and any replacement would have to be an AnyRef. This
> has consequences all the way down to the turtles.
> 2. Char and String are supposed to be the same in Scala as in Java,
> precisely so there's no "translation layer" between Java and Scala.
> If Odersky had not opted to make Scala interoperate with Java so
> transparently, then it might have been another matter. As it is, if you get
> a Char interface from Java, then you pass a Char to it, and vice versa.
> So, Char will always be a Java's char and String will always be a Java's
> String. Any improvement must be done without changing this. One could have
> an UTF32String and UTF32Char, for example, and all methods that come with
> it, plus implicit conversions to and from String, and from Char.
>
> On Wed, Dec 22, 2010 at 21:18, Jim Balter <Jim@balter.name> wrote:
>>
>> Again, read the java.lang.Character API, with its many codePoint
>> methods. As I said, you can code Java string handling correctly *if*
>> you assiduously avoid the char type and slicing or indexing strings.
>> And again, as I said, you can code it correctly in Scala if you use
>> those java.lang.Character methods -- but that's even more horrible to
>> do in Scala than in Java, where such imperative code is standard. In
>> Scala, you should be able to traverse a Java string as a sequence of
>> 32-bit Unicode characters, even though it is *encoded* as an array of
>> 16-bit UTF-16 code points. And you should be able to naturally
>> (functionally) traverse UTF-8 and UTF-32 encodings as well. And
>> character constants should be 32-bit entities, not 16-bit entities
>> (that's broken in Java too, but Java will never change whereas Scala
>> still can).
>>
>> On Wed, Dec 22, 2010 at 8:40 AM, Daniel Sobral <dcsobral@gmail.com> wrote:
>> > Woah! Wait a second!
>> >
>> > There isn't a "Scala's current indexOf method on String", because String
>> > is
>> > not a Scala class, and, naturally, neither is its indexOf method. It's
>> > JAVA.
>> > If you say Java got fixed, then it got fixed for both Java and Scala.
>> >
>> > Both Char and String are _Java_.
>> >
>> > So I don't understand how can you say that Java is getting fixed but
>> > Scala
>> > is stuck. Would you mind providing an example?
>> >
>> > On Tue, Dec 21, 2010 at 22:25, Jim Balter <Jim@balter.name> wrote:
>> >>
>> >> P.S. A moral that can be drawn from Jon's example is that, as more and
>> >> more Java code is converted to handle Strings properly by using the
>> >> codePoint methods that were added to java.lang.Character in 1.5,
>> >> interoperability between Scala and Java will break down. Given a
>> >> *properly* coded Java insert method and Scala's current indexOf method
>> >> on Strings, or v.v., you will get the wrong result for Strings that
>> >> contain supplemental characters. Strings are not arrays of fixed-size
>> >> chars, they are sequences of variable-size characters, and commercial
>> >> Java programmers now know that and have the tools to program them
>> >> properly -- but laboriously. But the only tools Scala programmers have
>> >> are those awful java.lang.Character methods; natural Scala approaches
>> >> give the wrong results.
>> >>
>> >>
>> >> On Tue, Dec 21, 2010 at 4:08 PM, Jim Balter <Jim@balter.name> wrote:
>> >> > On Tue, Dec 21, 2010 at 3:22 PM, Jon Pretty <_@scala.propensive.com>
>> >> > wrote:
>> >> >> Arya Irani wrote:
>> >> >>> Hmm, ok, good point. Where did myJavaString come from though?
>> >> >>
>> >> >> It could have come form /so many/ different places... ;-)
>> >> >>
>> >> >>> Ok, so... you'd need every java.lang.String to be autoboxed into a
>> >> >>> scala.JimString, and except (or unboxed) where java.lang.String is
>> >> >>> required. Thoughts?
>> >> >>
>> >> >> You could do that. But I'm playing devil's advocate, and I think
>> >> >> the
>> >> >> problem is deeper than
>> >> >> that. Consider the following:
>> >> >>
>> >> >> val js : JimString = "Cafébabe" // let's pretend é takes up two
>> >> >> code points
>> >> >>
>> >> >> org.random.javaproject.StringTools.insert(js, " ",
>> >> >> js.indexOf("babe"))
>> >> >>
>> >> >> I would like the insert method to give me the string "Café babe",
>> >> >> but
>> >> >> no matter how correct
>> >> >> the implementation of indexOf for JimString is, the implicit
>> >> >> conversion
>> >> >> will only convert
>> >> >> the JimString to a String; it won't convert the 4 to a 5.
>> >> >
>> >> > But but but ... 4 is the *correct* value; if
>> >> > org.random.javaproject.StringTools.insert needs a 5 there to do the
>> >> > right thing, then it is broken -- it doesn't handle full Unicode. If
>> >> > that insert method is coded correctly, then it does not do naive
>> >> > indexing or slicing operations ... it traverses the string one
>> >> > character at a time and inserts the space after the 4th character --
>> >> > *not* the 4th 16-bit "char" -- please read the java.lang.Character
>> >> > API
>> >> > that I posted previously. Keep in mind that java.lang.String itself
>> >> > is
>> >> > not broken, it's a valid Unicode representation (UTF-16), so
>> >> > converting another valid Unicode representation, whether js is UTF-8
>> >> > or UTF-16 or UTF-32, to java.lang.String is not a problem.
>> >> >
>> >> >> And ironically, the whole thing would have worked fine if we had
>> >> >> never
>> >> >> invented JimString...
>> >> >
>> >> > Well, no, it would not have "worked fine", because being broken for
>> >> > Unicode supplemental characters is not "fine".
>> >> >
>> >> >> I think the moral of the story is that it's too late to fix Java
>> >> >> Strings.
>> >> >
>> >> > It isn't necessary to fix Java Strings -- it's String *handling*, and
>> >> > the Java char and Scala Char types, that are broken. *Optionally* one
>> >> > could change the default String type to be UTF-8 or UTF-32, because
>> >> > UTF-16 is suboptimal for both space and speed (unless one brokenly
>> >> > treats UTF-16 as an indexable array), but it isn't necessary for
>> >> > correctness.
>> >> >
>> >> >> That's not to say there isn't a need for a better Unicode library,
>> >> >> but
>> >> >> it shouldn't be
>> >> >> seamless (seams make subtle boundaries clear) or the default,
>> >> >
>> >> > The current default is *broken* and results in all Scala code that
>> >> > operates on Strings as a collection of characters being broken. At
>> >> > the
>> >> > very least we need a transition path from that broken default to
>> >> > something that isn't broken.
>> >> >
>> >> >> and a prerequisite of using it
>> >> >> should be a good understanding of the issues surrounding Java
>> >> >> Strings.
>> >> >
>> >> > Yes, indeed, but that's something that some of the discussants here
>> >> > clearly lack.
>> >> >
>> >
>> >
>> >
>> > --
>> > Daniel C. Sobral
>> >
>> > I travel to the future all the time.
>> >
>
>
>
> --
> Daniel C. Sobral
>
> I travel to the future all the time.
>

--
Daniel C. Sobral

I travel to the future all the time.

Fri, 2010-12-24, 03:37

#40

Russ P.

Joined: 2009-01-31,

Re: [Serious messages]

I thought the article was pretty funny -- but that does not mean I share the author's viewpoint on Haskell.

Just for kicks, I read another of the author's articles, his July 2010 article on Java "private/final" modifiers. It was funny too, even though I disagree completely with the premise of the article. I think this guy just enjoys pushing people's buttons.

In the past, I have had online debates on comp.lang.python about "private/protected" modifiers. Python has no such thing, of course. I suggested that perhaps something like them should be added to the language if doing so is technically feasible. Many python enthusiasts were completely against it and claimed that "private" and "protected" are completely useless -- clients should be "treated as adults." I tried to explain that "private" and "protected" provide useful information to the client (through the compiler) about what is supposed to be in the public interface and what is not. I also pointed out that, if you have the source code (as you do with nearly all Python software), then you essentially have a master key and can remove any "private" or "protected" modifier that is getting in your way. Most of them would hear none of it. They just resent being told what they can use and what that can't. Oh, well.

I don't use Python much any more.

Russ P.

On Thu, Dec 23, 2010 at 5:12 PM, Tony Morris <tonymorris@gmail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 10:53, Warren Henning wrote:
> Dude, you totally got trolled.
Not exactly. A clueless person uttered words in such a way as to
display the magnificence of that cluelessness in an attempt to be
comical (which imo was also a failure). That person is renowned for
these displays. Alone, I offer no response.

Then, a person took this seriously. The latter is concerning, the
former is not. I am now compelled to respond.

- --
Tony Morris
http://tmorris.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0T82QACgkQmnpgrYe6r60x/QCdEwE0qiKp3LkSd8CSfvrc9yHA
tTIAoJnAmnp58cbnSR/8QJkdf0Lnd8vU
=an/K
-----END PGP SIGNATURE-----

--
http://RussP.us

Fri, 2010-12-24, 04:27

#41

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 6:24 PM, Daniel Sobral wrote:
> Changing the type of a character literal would instantly break any code
> using them, _plus_ make it hard to interact with anything with a Char
> interface. And what, exactly, would be gained by that which would not be
> gained by an implicit conversion?
> As for how one sees a Java String, if any Scala method returns an index
> which cannot be then used with a Java method, it would be broken. Most
> String methods are Java's, and StringOps only offer a complement. Some of
> them could be changed -- like map -- but at the cost of having StringOps
> handle a String in two different ways.

I see a number of assertions here but no support, and what seem to me
to be misunderstandings. What would be gained by changing the type of
a character literal is of course that one could then express all
Unicode characters in Scala source (note the io.Exception I mentioned
in my example when trying to enter a supplemental character into the
REPL).

> As for what a Java String is, even methods such as codePointAt treat it as a
> sequence of "char".

No, they certainly don't, and that's a fundamental misunderstanding --
see the example I just posted. The *representation* is a sequence of
"char", of course -- because Java chars are 16-bit entities and Java
strings are encoded in UTF-16. But it's an *encoding* -- characters
are represented by a variable number (1 or 2) of "char"s. We will just
waste our own and everyone else's time and talk past each other if no
distinction is made between the representation (char[] is the
parameter type of codePointAt) and the abstraction -- the nth
codePointAt is a 21-bit codePoint, which is returned as an int, not
the nth char (unless the first n-1 characters all happen to be in the
BMP).

> As for "that", I meant Char and String. They can't be changed because they
> are Java's, not Scala's.

Again, "what I wrote ... said nothing about changing what types Char and
String are". But it isn't true that the "can't" be changed -- String,
for instance, is simply a type declaration in Predef.scala. One could,
for example, provide a scala.FullUnicode module that declares String
and/or Char differently. Whether such a change in the declaration of
the Scala Char and String types should eventually be the default is
something that can be explored. But most of my recent posts have been
in the context of leaving the type of String unchanged and only
changing, e.g., "

Fri, 2010-12-24, 04:47

#42

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 7:20 PM, Jim Balter wrote:
> On Thu, Dec 23, 2010 at 6:24 PM, Daniel Sobral wrote:

>> As for what a Java String is, even methods such as codePointAt treat it as a
>> sequence of "char".
>
> No, they certainly don't, and that's a fundamental misunderstanding --

Sorry, you are correct, Daniel, and I am mistaken. But the Java API
requires meticulous parsing, maintaining an array index that must be
manually bumped by two whenever a surrogate is encountered. As I have
said before, this sort of thing may be acceptable in Java, where
everything is done manually and repetitively, but it does not at all
fit with Scala's functional style. We can certainly do better than
*that*.

Fri, 2010-12-24, 05:57

#43

arya

Joined: 2010-02-11,

Re: Support for Ropes in Scala]

Ok, so...

What are some performance considerations that should be kept in mind when implementing a UTF string library?
Must a UTF8 string be stored as an Array[Byte]? What about Seq[Seq[Byte]] or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a code point?
-Arya

Fri, 2010-12-24, 05:57

#44

Tony Morris 2

Joined: 2009-03-20,

UTF-8 strings

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/12/10 14:50, Arya Irani wrote:
> Ok, so...
>
> What are some performance considerations that should be kept in
> mind when implementing a UTF string library?
>
> Must a UTF8 string be stored as an Array[Byte]? What about
> Seq[Seq[Byte]] or Array[Seq[Byte]] or Array[Array[Byte]], where
> each element represents a code point?
>
> -Arya
How about we move this to scala-debate with an appropriate topic?

After all, the original post asked about ropes in scala (of which I
know of an implementation in scalaz) and a few other things around
this topic and not UTF-8 and strings on the JVM.

Just suggestin'

Fri, 2010-12-24, 07:07

#45

d_m

Joined: 2010-11-11,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 11:50:00PM -0500, Arya Irani wrote:
> What are some performance considerations that should be kept in mind when
> implementing a UTF string library?
>
> Must a UTF8 string be stored as an Array[Byte]? What about Seq[Seq[Byte]]
> or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a
> code point?

Just to be clear: I think Jim wants a Unicode string library rather
than just a UTF-8 string library. Unicode is a specification which
assigns numbers to glyphs, whereas UTF-8 is a particular method of
storing strings of Unicode glyphs as bytes.

You can use Seq[Int] (which corresponds to the UTF-32 encoding) to
correctly represent all existing Unicode glyphs at the cost of
increased memory usage. For instance, "cat" takes 12 bytes when
represented as Array[Int] (UTF-32) but 3 bytes when represented as
Array[Byte] (UTF-8).

UTF-8 uses a variable number of bits to reduce memory usage (in essence
a simple form of compression), but this complicates code which wants to
handle the string in terms of glyphs (for instance the number of bytes
in a UTF-8 string will often differ from the number of Unicode glyphs).
With the Seq[Int] representation the length of the sequence is the same
as the number of glyphs.

I think the ideal would be to have a library which can deal with (at
least) two representations of a Unicode string: UTF-32 (Seq[Int]) and
UTF-8 (Seq[Byte]). One could use the former for simplicity and speed,
and the latter when trying to conserve memory and for I/O.

None of this is profound, but I thought it would be useful to make the
distinction between Unicode and UTF-8 (the two are often conflated).

Fri, 2010-12-24, 07:27

#46

ichoran

Joined: 2009-08-14,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 7:22 PM, Jim Balter <Jim@balter.name> wrote:

On Thu, Dec 23, 2010 at 4:09 PM, Rex Kerr <ichoran@gmail.com> wrote:
> If Scala changes, Java-Scala iterop breaks.

There are changes, and then there are changes. For instance, changing
StringOps to implement the correct abstraction would not break interop
with Java.

Sure it would, because StringOps look like methods on String, and if you changed StringOps to implement a different abstraction, String wouldn't even look self-consistent. So you're basically forced to use something other than String. Maybe a UTF-8 rope?

> But you can always write a
> unicode-aware layer on top of Java chars and Strings, and use that instead.
> If you're careful, you can make it look to Java-land like whatever the
> consensus solution is in Java-land, and make it extra-pretty in Scala.
> Implicit conversions are your friend.

However, Jon Pretty pointed out that implicit conversions may not
always be invoked when you want them to be -- they only are when the
code doesn't type check. This may put serious restrictions on a
solution.

You _might_ have to do something as drastic as define a .u method on String to translate to a Unicode-safe class (n.b. Regex and .r). Horrors!

> So why not _do_ it instead of squabbling about it?

I am doing it.

> (Note: I said _instead of_ not _as well as_. Code speaks louder than
> words.)

That is rather hypocritical of you, who are *only* squabbling. I think
discussing the issues is worthwhile; if you don't want to hear from
me, there are mechanisms available to you to avoid doing so.

I'm happy to hear a discussion of the issues. But sniping is pretty pointless, and you're not even doing it very well. For example, hypocrisy is the pretense of having virtues that one does not, especially as evidenced by applying different standards to oneself and others. But the standard I am applying to you is, "If you dislike something that is fixable, fix it instead of complaining at length." This standard _does not even apply_ to me since what I dislike is probably not fixable, and to the extent that it is, complaining seems a reasonable way to attempt to accomplish it. So your characterization is simply wrong; that is not what the word "hypocritical" means. (This isn't the first time you've made questionable or incorrect statements of this sort.) Even if you manage to correctly employ negative adjectives, you'll likely only manage to irritate the very people who might be useful at giving advice or help in fixing the problem.

So far, I haven't seen any particularly interesting details about the problems with Unicode support. Okay, fine, if you stuff a value of 100k into 16 bits, it'll wrap. Not particularly surprising. Write an iterator that reads UTF-16 and returns code points, and write a routine to convert a collection of code points into a string, and wrap the code point methods of java.lang.Character for more convenient use. And _then_ show why the features available in Scala don't make this an entirely workable solution. Or at least describe the problems in a lot more technical detail. Otherwise you're not really "discussing the issues"; you're just arguing. (Arguing has its place too, but this is getting rather old.)

--Rex

Fri, 2010-12-24, 07:47

#47

Jim McBeath

Joined: 2009-01-02,

Re: Support for Ropes in Scala]

On Thu, Dec 23, 2010 at 04:22:27PM -0800, Jim Balter wrote:
> I am doing it.

In the spirit of "release early and often", would be you willing to
share with us the technical details that you have so far? An API
description would allow me to gain a much better understanding of what
you are proposing, and I would enjoy a dialog focused on designing a
good solution.

--
Jim

Fri, 2010-12-24, 07:57

#48

jibal

Joined: 2010-12-01,

Re: Support for Ropes in Scala]

See the comments in scala-debate, where I will put my future posts on
this subject.

On Thu, Dec 23, 2010 at 10:37 PM, Jim McBeath wrote:
> On Thu, Dec 23, 2010 at 04:22:27PM -0800, Jim Balter wrote:
>> I am doing it.
>
> In the spirit of "release early and often", would be you willing to
> share with us the technical details that you have so far? An API
> description would allow me to gain a much better understanding of what
> you are proposing, and I would enjoy a dialog focused on designing a
> good solution.
>
> --
> Jim
>

Fri, 2010-12-24, 11:37

#49

Ben Hutchison 3

Joined: 2009-11-02,

Re: Support for Ropes in Scala]

On Fri, Dec 24, 2010 at 8:58 AM, Kevin Wright wrote:
> http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-dis...
>
> There's actually a serious message here.

Ok, I'll bite. Its a funny piece. Steve Yegge has some real talent as
a comedy writer. He's collated a bunch of the classic stereotypes
about programming communities and cultures into a cleverly executed,
amusing parody.

However, that piece does nothing to advance a readers understanding or
appreciation of Haskell, or Functional Programming, or Static Typing,
or the influence that these have had, and are having, on how we build
software. It quite purposely perpetuates tired old cliches that
mislead and obscure.

Great entertainment value. But don't be fooled that it says something
significant about how we should do programming.

-Ben

Fri, 2010-12-24, 12:07

#50

Kevin Wright 2

Joined: 2010-05-30,

Re: Support for Ropes in Scala]

Just to clarify... My point was serious, not the article's (as if it even had a point to make)
I just wanted to lighten the mood a bit first, not to claim that the article was anything other than light-hearted humour.

On 24 December 2010 10:34, Ben Hutchison <brhutchison@gmail.com> wrote:

On Fri, Dec 24, 2010 at 8:58 AM, Kevin Wright <kev.lee.wright@gmail.com> wrote:
> http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html
>
> There's actually a serious message here.

Ok, I'll bite. Its a funny piece. Steve Yegge has some real talent as
a comedy writer. He's collated a bunch of the classic stereotypes
about programming communities and cultures into a cleverly executed,
amusing parody.

However, that piece does nothing to advance a readers understanding or
appreciation of Haskell, or Functional Programming, or Static Typing,
or the influence that these have had, and are having, on how we build
software. It quite purposely perpetuates tired old cliches that
mislead and obscure.

Great entertainment value. But don't be fooled that it says something
significant about how we should do programming.

-Ben

--
Kevin Wright

gtalk / msn : kev.lee.wright@gmail.com kev.lee.wright@gmail.commail: kevin.wright@scalatechnology.com
vibe / skype: kev.lee.wright
twitter: @thecoda

Scala Main Menu

Scala Quick Links

Featured News

User login