ruby-core

Hello Dave,

Here's some feedback. I have to admit that I have difficulties
to comment only on some aspects of a document, so you get a bit
of everything, from exposition to typos, sorry.

General:

I think you spend too many pages on explaining encodings
for program source, and maybe not enough for encoding of data.
The later, as you say halfway into the current text, is more
important, but by that time, you may have lost the reader
who things that she's not going to use non-ASCII identifiers
anyway. I'd try to start with external data, then go to data
in source files, and only use non-ASCII identifiers at the
end (you already have a bit of that).

I think you should also explain where the boundaries of the
current approach are. The main point is that String-related
operations break down if the strings (or regexps,...) have
truely different encodings (there is quite a bit of machinery
to try and make sure that things still work if you e.g.
have a string labeled UTF-8 and a string labeled SHIFT_JIS,
but both only contain ASCII). What the programmer has to
do in such a case is to make sure the objects are correctly
labeled, and to transcode if necessary (String#encode or
Iconv or Kconv can be used for the later). I'd at least
include one example that produces an error, and one or two
examples with a fix, to help readers who sooner or later
will bump into this. It's also possible to bump into this
with identifiers (or symbols in general), because they
have to match both in terms of bytes and in terms of
encoding label.

In your text, you always use SJIS, whereas the canonical
name is SHIFT_JIS. I suggest you use the later more,
and you mention that some encodings have aliases, and
that all encoding names can be written in any case
you want.

For the association between objects and encodings, you
often use 'encoded'. In many if not all instances in the text,
this suggests that Ruby is actually doing some encoding,
where Ruby simply *labels* the object (string,...).
Using the word 'label' instead of 'encode' will help
getting the right message across.

p. 1: the PDF writing library: case typo

p. 2 is a bit lengthy. I'd be much more direct in saying
that if your source file contains anything but strict US-ASCII,
label it in the first (or second) line.

p.2, bottom, you say that ASCII-8BIT treats bytes 128 to 255
as additional letters. This is confusing, because it suggests
that e.g.
    "a\xCCbc" ~= /^\w*$/
would return true, which isn't the case.

You never mention the escape syntax that can be used to
add bytes (or Unicode characters) in strings. I think
at least a pointer would be helpful.

For the cat/dog example, you use a non-ASCII d-like character.
It's usually helpful to mention what character exactly it is.

p. 5: "Notice how each file has independent encoding.": Missing article?
The wording I always use in such as case is "encoding is a property
of the data, not of the overall program".

p. 5: Example 12 may stop to work in the near future, because
on the Japanese list, it was pointed out that the IANA charset
registry has 'ascii' as an alias of 'us-ascii', and therefore
Ruby should do this too, but currently it's an alias of ASCII-8BIT.
ASCII-8BIT is, more or less as you describe it,
    lower bytes -> ASCII
    upper bytes -> some data, seen as bytes
US-ASCII, on the other hand, is
    lower bytes -> ASCII
    upper bytes -> disallowed
In that thread, Matz pointed out that probably nobody will notice.
I then said that I probably woudn't have noticed, but now that I
know, I feel responsible to request a change (or change it myself)
because I'm also an expert reviewer for that registry.
[I also said that I don't think that everything in that registry
should be followed slavishly, but that unnecessary and arbitrary
differences should be avoided.]
I haven't yet heard back from Matz or others on that point, but
I'm inclined to go ahead and change it, and that would affect
example 12, and maybe other parts of your text.

p. 6: e-acute glyph -> e-acute character
(the whole thing is about encoding characters with bytes, we don't
have to confuse the reader here with glyphs)

example 16: I think it would help a lot if you clearly said here
that changing the external encoding of a file doesn't change any
data (neither externally nor internally), only how the data is
labeled once it gets read in.

p. 9: My guess would be that the -E option only applies to otherwise
unlabeled source files, but I haven't checked that. If that's the
case, you should say so.

Hope this helps.    Regards,    Martin.

At 13:08 08/01/10, Dave Thomas wrote:
>Folks:
>
>I've put up a rough draft of the description of how Ruby 1.9 encoding  
>works at
>
>http://media.pragprog.com/titles/ruby3/encoding.pdf
>
>I'd appreciate any feedback on the concepts: do I capture the intent  
>behind the way encoding works, and is it an accurate description of  
>how it's used? What I'm looking to do is to cover the way it should be  
>used in practice, rather than explore each and every dark corner.
>
>Cheers
>
>
>Dave
>

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Thread

Prev Next

In This Thread

Prev Next