[#107430] [Ruby master Feature#18566] Merge `io-wait` gem into core IO — "byroot (Jean Boussier)" <noreply@...>

Issue #18566 has been reported by byroot (Jean Boussier).

22 messages 2022/02/02

[#107434] [Ruby master Bug#18567] Depending on default gems when not needed considered harmful — "Eregon (Benoit Daloze)" <noreply@...>

Issue #18567 has been reported by Eregon (Benoit Daloze).

31 messages 2022/02/02

[#107443] [Ruby master Feature#18568] Explore lazy RubyGems boot to reduce need for --disable-gems — "headius (Charles Nutter)" <noreply@...>

Issue #18568 has been reported by headius (Charles Nutter).

13 messages 2022/02/02

[#107481] [Ruby master Feature#18571] Removed the bundled sources from release package after Ruby 3.2 — "hsbt (Hiroshi SHIBATA)" <noreply@...>

Issue #18571 has been reported by hsbt (Hiroshi SHIBATA).

9 messages 2022/02/04

[#107490] [Ruby master Bug#18572] Performance regression when invoking refined methods — "palkan (Vladimir Dementyev)" <noreply@...>

Issue #18572 has been reported by palkan (Vladimir Dementyev).

12 messages 2022/02/05

[#107514] [Ruby master Feature#18576] Rename `ASCII-8BIT` encoding to `BINARY` — "byroot (Jean Boussier)" <noreply@...>

Issue #18576 has been reported by byroot (Jean Boussier).

47 messages 2022/02/08

[#107536] [Ruby master Feature#18579] Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents — "tenderlovemaking (Aaron Patterson)" <noreply@...>

Issue #18579 has been reported by tenderlovemaking (Aaron Patterson).

11 messages 2022/02/09

[#107547] [Ruby master Bug#18580] Range#include? inconsistency for String ranges — "zverok (Victor Shepelev)" <noreply@...>

Issue #18580 has been reported by zverok (Victor Shepelev).

10 messages 2022/02/10

[#107603] [Ruby master Feature#18589] Finer-grained constant invalidation — "kddeisz (Kevin Newton)" <noreply@...>

Issue #18589 has been reported by kddeisz (Kevin Newton).

17 messages 2022/02/16

[#107624] [Ruby master Bug#18590] String#downcase and CAPITAL LETTER I WITH DOT ABOVE — "andrykonchin (Andrew Konchin)" <noreply@...>

Issue #18590 has been reported by andrykonchin (Andrew Konchin).

13 messages 2022/02/17

[#107651] [Ruby master Misc#18591] DevMeeting-2022-03-17 — "mame (Yusuke Endoh)" <noreply@...>

Issue #18591 has been reported by mame (Yusuke Endoh).

11 messages 2022/02/18

[#107682] [Ruby master Feature#18595] Alias `String#-@` as `String#dedup` — "byroot (Jean Boussier)" <noreply@...>

Issue #18595 has been reported by byroot (Jean Boussier).

15 messages 2022/02/21

[#107699] [Ruby master Feature#18597] Strings need a named method like `dup` that doesn't duplicate if receiver is mutable — "danh337 (Dan H)" <noreply@...>

Issue #18597 has been reported by danh337 (Dan H).

18 messages 2022/02/21

[ruby-core:107545] [Ruby master Feature#18579] Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents

From: duerst <noreply@...>
Date: 2022-02-10 07:25:26 UTC
List: ruby-core #107545
Issue #18579 has been updated by duerst (Martin D端rst).


tenderlovemaking (Aaron Patterson) wrote in #note-3:
> naruse (Yui NARUSE) wrote in #note-2:
> > The encoding of the resulted string depends "ascii only or not" and "ascii compatibility".
> > The principle of the resulted encoding is
> > * if LHS is ascii only
> >   * if RHS is ascii only
> >      * LHS's encoding
> >   * else if RHS is ascii compatible
> >      * RHS's encoding
> >   * else (RHS is ascii incompatible)
> >      * exception.
> > * else if LHS is ascii compatible
> >   * if RHS is ascii only
> >      * LHS's encoding
> >   * else if RHS is ascii compatible
> >      * if LHS's encoding equals to RHS's encoding
> >          * LHS's encoding
> >      * else
> >          * exception
> >   * else (RHS is ascii incompatible)
> >      * exception.
> > * else if LHS is ascii compatible
> >     * if LHS's encoding equals to RHS's encoding
> >       * LHS's encoding
> >     * else
> >       * exception
> 
> Is there anything we can do to simplify these rules?  It's way too complicated to remember.

First, it should be "* else if LHS is ascii incompatible" in the third bullet on the top level.

Second, it's not really that difficult to understand what happens. An encoding defines a mapping from a sequence of bytes to a sequence of characters. What concatenation does is that it concatenates the sequence of bytes and the sequence of characters at the same time, if it can easily know that that's possible (*). If not, then an error is produced.

(*) The reason for writing "if it can easily know" is because there are cases such as the following:
```
"r辿sum辿".encode('iso-8859-1') + "r辿sum辿".encode('iso-8859-2')
```
Because "辿" is encoded with the same byte in both iso-8859-1 and iso-8859-2, this concatenation would in theory work with the above rules, but it's too much to teach the implementation about which 8-bit bytes,... code for the same character in which encoding.

----------------------------------------
Feature #18579: Concatenation of ASCII-8BIT strings shouldn't behave differently depending on string contents
https://bugs.ruby-lang.org/issues/18579#change-96457

* Author: tenderlovemaking (Aaron Patterson)
* Status: Rejected
* Priority: Normal
----------------------------------------
Currently strings tagged with ASCII-8BIT will behave differently when concatenating depending on the string contents.

When concatenating strings the resulting string has the encoding of the LHS.  For example:

```
z = a + b
```

`z` will have the encoding of `a` (if the encodings are compatible).


However `ASCII-8BIT` behaves differently.  If `b` has "ASCII-8BIT" encoding, then the encoding of `z` will sometimes be the encoding of `a`, sometimes it will be the encoding of `b`, and sometimes it will be an exception.

Here is an example program:

```ruby
def concat a, b
  str = a + b
  str
end

concat "bar",                     "foo".encode("US-ASCII")    # Return value encoding is LHS, UTF-8
concat "bar".encode("US-ASCII"),  "foo".b                     # Return value encoding is LHS, US-ASCII
concat "ほげ",                    "foo".b                     # Return value encoding is LHS, UTF-8
concat "bar",                     "bad\376\377str".b          # Return value encoding is RHS, ASCII-8BIT.  Why?
concat "ほげ",                    "bad\376\377str".b          # Exception
```

This behavior is too hard to understand.  Usually we think LHS encoding will win, or there will be an exception. Even worse is that string concatenation can "infect" strings.  For example:


```ruby
def concat a, b
  str = a + b
  str
end

str = concat "bar", "bad\376\377str".b # this worked
p str
str = concat "ほげ", str               # exception
p str
```

The first concatenation succeeded, but the second one failed.  As a developer it is difficult to find where the "bad string" was introduced.  In the above example, the string may have been read from the network, but by the time an exception is raised it is far from where the "bad string" originated.  In the above example, the bad data came from like 6, but the exception was raised on line 8.

I propose that ASCII-8BIT strings raise an exception if they cannot be converted in to the LHS encoding.  So the above program would become like this:

```ruby
def concat a, b
  str = a + b
  str
end

concat "bar",                     "foo".encode("US-ASCII")    # Return value encoding is LHS, UTF-8
concat "bar".encode("US-ASCII"),  "foo".b                     # Return value encoding is LHS, US-ASCII
concat "ほげ",                    "foo".b                     # Return value encoding is LHS, UTF-8
concat "bar",                     "bad\376\377str".b          # Exception <--- NEW!!
concat "ほげ",                    "bad\376\377str".b          # Exception
```


I'm open to other solutions, but the underlying issue is that concatenating an ASCII-8BIT string with a non-ASCII-8BIT string is usually a bug and by the time an exception is raised, it is very far from the origin of the string.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread