[#395238] rubygem: ispunity (unite all your internet connections) — Arun Tomar <tomar.arun@...>

Dear friends,

12 messages 2012/05/01

[#395250] Overwriting one Ruby array or arrays with another — Craig Law <lists@...>

Hi

14 messages 2012/05/02

[#395258] array of strings - finding letter combinations — "Sebastjan H." <lists@...>

Hi All,

16 messages 2012/05/02

[#395357] Why Enumerator#next does not return more than one value? — Földes László <lists@...>

If I have an Enumerator which yields elements of a mathematical series

10 messages 2012/05/07

[#395373] How to use Data_Wrap_Struct to assign the DATA VALUE to an exsiting Ruby object? — Iñaki Baz Castillo <ibc@...>

Hi, my code receives an arbitrary klass name (provided by the user)

8 messages 2012/05/07

[#395429] passing via instance variable or regular () — sam jam <lists@...>

def first

10 messages 2012/05/10

[#395463] I'm looking for a Metaprogramming Project — Phil Stone <lists@...>

Hello,

19 messages 2012/05/11

[#395548] A million reasons why Encoding was a mistake — Marc Heiler <lists@...>

Newcomer wants to try Ruby.

15 messages 2012/05/15
[#395561] Re: A million reasons why Encoding was a mistake — Ryan Davis <ryand-ruby@...> 2012/05/15

[#395595] Re: A million reasons why Encoding was a mistake — Brian Candler <lists@...> 2012/05/16

I will add that the OP is not entirely alone in his opinion.

[#395551] How to ensure that a block runs entirely after other threads? (Thread.exclusive does not "work") — Iñaki Baz Castillo <ibc@...>

Hi, I expected that in the following example code, thread t1 would not

8 messages 2012/05/15

[#395575] GUI with ruby on windows — David Acosta <lists@...>

hello friends, i am a begginer and i have a litlle question, how can i

17 messages 2012/05/16

[#395604] what is going wrong here? — roob noob <lists@...>

Notice the initialization of both classes in each of the examples, if

20 messages 2012/05/16

[#395646] rb_gc_register_address() or rb_gc_mark()? — Iñaki Baz Castillo <ibc@...>

Hi, I've bad experiences with rb_gc_register_address(), it does never

16 messages 2012/05/17

[#395686] reading from and writing to a Unicode encoded file — "Sebastjan H." <lists@...>

Hi,

19 messages 2012/05/18
[#395694] Re: reading from and writing to a Unicode encoded file — Regis d'Aubarede <lists@...> 2012/05/18

Hello,

[#395697] Re: reading from and writing to a Unicode encoded file — "Sebastjan H." <lists@...> 2012/05/18

Regis d'Aubarede wrote in post #1061272:

[#395698] Re: reading from and writing to a Unicode encoded file — Regis d'Aubarede <lists@...> 2012/05/18

Sebastjan H. wrote in post #1061276:

[#395699] Re: reading from and writing to a Unicode encoded file — "Sebastjan H." <lists@...> 2012/05/18

Regis d'Aubarede wrote in post #1061277:

[#395750] Re: reading from and writing to a Unicode encoded file - issues when using Shoes — "Sebastjan H." <lists@...> 2012/05/21

Hi,

[#395754] Re: reading from and writing to a Unicode encoded file - issues when using Shoes — "Sebastjan H." <lists@...> 2012/05/21

Sebastjan H. wrote in post #1061483:

[#395740] ? Ruby through CGI and Rails — Shaun Lloyd <list@...>

Hi everybody,

22 messages 2012/05/21
[#395764] Re: Ruby through CGI and Rails — Brian Candler <lists@...> 2012/05/21

Shaun Lloyd wrote in post #1061455:

[#395786] Re: Ruby through CGI and Rails — Shaun Lloyd <list@...> 2012/05/22

On 22/05/12 03:37, Brian Candler wrote:

[#395838] Re: Ruby through CGI and Rails — Brian Candler <lists@...> 2012/05/23

Shaun Lloyd wrote in post #1061602:

[#395787] Changing self class from inside a method?? — David Madison <lists@...>

Let's start off with the assumption I want a method that allows an

10 messages 2012/05/22

[#395841] Memory-efficient set of Fixnums — George Dupre <lists@...>

Hi,

25 messages 2012/05/23

[#395883] looking for a ruby idiom : r=foo; return r if r — botp <botpena@...>

Hi All,

11 messages 2012/05/24

[#395966] Am I justified to use a global variable if it must be used in all scopes? — Phil Stone <lists@...>

Hello,

12 messages 2012/05/27

[#396010] does this leak more than the size of the string via timing side channels — rooby shoez <lists@...>

string1 = "string"

16 messages 2012/05/29

[#396038] Is it possible to avoid longjmp in exceptions, Thread#kill, exit(), signals? — Iñaki Baz Castillo <ibc@...>

Hi, my Ruby C extension runs a C loop (libuv) without GVL. At some

8 messages 2012/05/29

Re: A million reasons why Encoding was a mistake

From: Brian Candler <lists@...>
Date: 2012-05-22 07:37:56 UTC
List: ruby-talk #395791
Austin Ziegler wrote in post #1061436:
> This is *not* a Ruby problem, this is a *data* problem.

Leaving aside the point that not all data is text, you still need a 
clear conceptual model to be able to reason about your program.

In Python 3, there is a clear distinction between "characters" and "a 
sequence of bytes which encode those characters". They are two 
completely different classes and cannot be combined (e.g. a+b will 
always fail if a is str and b is bytes). It's also symmetrical: you 
convert from bytes to characters as text enters your program, and from 
characters to bytes as text leaves it.

(Aside: I know that Python only supports Unicode characters, but this is 
just an implementation limitation. There could be a third class 
"gb2312str" if desired, and additional classes for other character sets 
which are not subsets of Unicode)

Ruby muddles these concepts by having all strings be a sequence of bytes 
plus the encoding, which in turn muddles the concepts of "character set" 
and "a method of encoding that character set".

Now, you could argue that Ruby is actually implementing the Python 3 
approach but in a "lazy" way: by not explicitly converting bytes to 
characters until required, it avoids potentially unnecessary work. But 
if so, it's half-baked. For example, you cannot combine a UTF16-LE 
string with a UTF16-BE string, even though they are the same character 
set (Unicode). What's worse is that a UTF16-LE string will sort 
differently to a UTF16-BE string (because ruby 1.9 sorts by byte 
ordering, which happens to work for UTF8 but not all other encodings of 
Unicode). So it kind-of behaves like a string of characters, except that 
it doesn't.

Furthermore, ruby sometimes lets you combine objects representing 
"characters" and "bytes", or "characters with encoding A" and 
"characters with encoding B". Whether it is allowed or not depends on 
the run-time contents of those objects.

If a = b + c *always* crashed when b and c had different encodings, I 
would really not have a problem with any of this. Your test case would 
immediately catch it, you fix it, problem solved.

However ruby 1.9's insidious behaviour means that b + c may *or may not* 
crash depending not only the encodings but the actual content of the 
strings at that instant. One perfectly reasonable set of tests may pass; 
actual application data may fail.

Finally, ruby is asymmetrical. On input, encodings are tagged; on 
output, they are ignored (by default). From files, the environment 
encoding is used; from sockets, the ASCII_8BIT encoding is used. WIth 
regexps, invalid strings cause an exception; with String#[] they do not. 
It is just an utter dog's breakfast of arbitrary rules which you just 
have no choice but to learn.

Some people see ruby 1.9's highly complex encoding implementation as a 
triumph of engineering; I see it as design smell.

> Matz and others have worked very hard to make sure that Ruby 1.9 works
> well if you follow certain rules regarding your inputs and outputs.

... which one has to absorb by osmosis. Certainly the core API docs 
don't give these rules; in fact they give precious little about the 
encoding semantics of String. And you can't get much more of a core part 
of the language than String.

Want to find out what String#[] does when given a string which contains 
invalid characters in its declared encoding? The docs won't help you. 
Try it and see. Or go to the C source code.

Of course, because every String is now two-dimensional (x = sequence of 
bytes, y = Encoding) there is a much higher requirement to document 
every method which acts on a string or returns on a string, because 
there is a much larger variety of inputs and outputs to consider.

Take strings with invalid characters, for example, or the fact that 
every returned string also has an encoding and you need to document how 
it is chosen. (For example Net::HTTP: does it return strings with 
encoding from the Content-Type header? You tell me)

Incidentally, strings with invalid characters are not an edge case or 
only for erroneous input. Ruby encourages you to do things like:

    txt = sock.read(4096)    # txt likely to contain a split character 
at the end

This could be dealt with if explicitly converting bytes to characters at 
some point (you'd buffer the extra bit). By not having this explicit 
conversion, you are quite likely to have byte patterns which don't 
represent *any* character. Yes you can do the buffering yourself; I'm 
just saying that all methods need to *document* whether they do accept 
strings with invalid bytes, and how they handle them.

> If you don't respect your encodings, they will bite you. They may not
> bite you up front (as they do with Ruby, because it exposes these
> things which are painful), but they *will* bite you.

Certainly you need to know about character sets and how they are 
encoded. This does not imply that ruby does it in a sane way. And as I 
said before, if Ruby were to bite you consistently, it would be much 
better.

> Ruby got it right, because it acknowledges that (a) this is hard and
> (b) gives you the tools you need in order to make this less painful.
> It also doesn't (c) incorrectly assume that everything is or can be
> expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
> and back for some characters.)

That's kind of irrelevant, since ruby 1.9 doesn't really handle 
Shift-JIS either, except to transcode it.

-- 
Posted via http://www.ruby-forum.com/.

In This Thread