This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Detecting unicode

6 replies
sadie
Joined: 2008-12-21,
User offline. Last seen 42 years 45 weeks ago.

I have a number of files to read that come in different character encodings -
mostly UCS-2, for some bizarre reason. Scala's file methods, like Java's,
uses the system default encoding unless you manually specify another. But
since neither my code nor the system can know what each file is in advance,
I need to detect the character encoding of a given file.

Thankfully they do all have the correct BOM, the two or three bytes at the
start of the file indicating which variant of unicode they belong to, so in
theory it should be a straightforward job to detect the correct encoding for
a given file. Straightforward but laborious - especially if you want to
detect the full set - and there's a chance I'd get it wrong.

Is there some code out there or in the standard libaries that already that
does this, or do I have to roll my own?

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: Detecting unicode

On Sun, Feb 01, 2009 at 10:55:41AM -0800, Marcus Downing wrote:
> Is there some code out there or in the standard libaries that already that
> does this, or do I have to roll my own?

It's not scala, but since I just recently had to learn more than I cared to about charsets, this is the most
promising code I found for that purpose.

http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

sadie
Joined: 2008-12-21,
User offline. Last seen 42 years 45 weeks ago.
Re: Detecting unicode

Paul Phillips wrote:
>
> It's not scala, but since I just recently had to learn more than I cared
> to about charsets, this is the most
> promising code I found for that purpose.
>
> http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
>
That looked promising - it returned the correct encoding, and I could see
that the data was in its cache - but something was stopping BufferedReader
from working. I could probably have fixed it, but I found this instead and
got it working:

http://koti.mbnet.fi/akini/java/unicodereader/

Scala doesn't seem to like accessing Java classes with an empty package, so
I had to stick it under org.unicode and make the constructor public.

Naftoli Gugenheim
Joined: 2008-12-17,
User offline. Last seen 42 years 45 weeks ago.
Re: Detecting unicode
While the topic is being discussed, a while ago a made a little quiz script. The New Jersey Motor Vehicles Commission has online (Ajax) quizzes to prepare you to take the permit test. I downloaded the XML files that it uses behind the scenes and a made a little console script to test you on it. I ran into a minor problem. Apparently the XML files start with a BOM, and Scala was reading it as gibberish -- invalid XML. I ended up opening each file in Notepad++ and converting it from with BOM to without BOM.

On Sun, Feb 1, 2009 at 8:20 PM, Marcus Downing <marcus@minotaur.it> wrote:


Paul Phillips wrote:
>
> It's not scala, but since I just recently had to learn more than I cared
> to about charsets, this is the most
> promising code I found for that purpose.
>
>   http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
>
That looked promising - it returned the correct encoding, and I could see
that the data was in its cache - but something was stopping BufferedReader
from working. I could probably have fixed it, but I found this instead and
got it working:

 http://koti.mbnet.fi/akini/java/unicodereader/

Scala doesn't seem to like accessing Java classes with an empty package, so
I had to stick it under org.unicode and make the constructor public.
--
View this message in context: http://www.nabble.com/Detecting-unicode-tp21778870p21782719.html
Sent from the Scala - User mailing list archive at Nabble.com.


Derek Chen-Becker
Joined: 2008-12-16,
User offline. Last seen 42 years 45 weeks ago.
Re: Detecting unicode

Naftoli Gugenheim wrote:
> I ran into a minor problem. Apparently the XML files start with a BOM,
> and Scala was reading it as gibberish -- invalid XML. I ended up opening
> each file in Notepad++ and converting it from with BOM to without BOM.

I've seen the same issue. I had thought that the BOM would be properly
processed but it was coming through as garbage, so I ended up hacking by
explicitly skipping the BOM bytes in my input. I'd love for it to be
handled correctly, but I haven't had time to really dig into the code.

Derek

Alex Cruise
Joined: 2008-12-17,
User offline. Last seen 2 years 26 weeks ago.
Re: Detecting unicode

Derek Chen-Becker wrote:
> I've seen the same issue. I had thought that the BOM would be properly
> processed but it was coming through as garbage, so I ended up hacking by
> explicitly skipping the BOM bytes in my input. I'd love for it to be
> handled correctly, but I haven't had time to really dig into the code.
>
Seeing as scala.io.Source already presents a character-oriented view of
I/O, I think BOM sniffing would be a useful enhancement there. IIRC a
lot of XML parsers do it.

-0xe1a

Jon Pretty
Joined: 2009-02-02,
User offline. Last seen 42 years 45 weeks ago.
Re: Detecting unicode

Hi Marcus,

Thanks for introducing me to BOMs - I'd somehow never been aware of them
before!

Marcus Downing wrote:
> Scala doesn't seem to like accessing Java classes with an empty package, so
> I had to stick it under org.unicode and make the constructor public.

I've not tried this, but does importing _root_.JavaClass work?

Jon

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland