[#102393] [Ruby master Feature#17608] Compact and sum in one step — sawadatsuyoshi@...

Issue #17608 has been reported by sawa (Tsuyoshi Sawada).

13 messages 2021/02/04

[#102438] [Ruby master Bug#17619] if false foo=42; end creates a foo local variable set to nil — pkmuldoon@...

Issue #17619 has been reported by pkmuldoon (Phil Muldoon).

10 messages 2021/02/10

[#102631] [Ruby master Feature#17660] Expose information about which basic methods have been redefined — tenderlove@...

Issue #17660 has been reported by tenderlovemaking (Aaron Patterson).

9 messages 2021/02/27

[#102639] [Ruby master Misc#17662] The herdoc pattern used in tests does not syntax highlight correctly in many editors — eregontp@...

Issue #17662 has been reported by Eregon (Benoit Daloze).

13 messages 2021/02/27

[#102652] [Ruby master Bug#17664] Behavior of sockets changed in Ruby 3.0 to non-blocking — ciconia@...

Issue #17664 has been reported by ciconia (Sharon Rosner).

23 messages 2021/02/28

[ruby-core:102462] [Ruby master Bug#17594] Sort order of UTF-16LE is based on binary representation instead of codepoints

From: daniel@...42.com
Date: 2021-02-11 15:56:48 UTC
List: ruby-core #102462
Issue #17594 has been updated by Dan0042 (Daniel DeLorme).


Technically speaking I'm not sure this is a bug per se. The code correctly implements the spec of sorting by byte, but we might consider this a bug in the spec, or a feature request?

### summary for dev meeting

For most encodings, the byte ordering is the same as the codepoint ordering, except for these:
UTF-16LE: 256 < 255
UTF-32LE: 256 < 255
Windows-31J: 33088 < 223
Shift_JIS: 33088 < 223
MacJapanese: 33088 < 223
SJIS-DoCoMo: 33088 < 223
SJIS-KDDI: 33088 < 223
SJIS-SoftBank: 33088 < 223
UTF-16BE: 65536 < 65535
UTF-16: 65536 < 65535

For the UTF family of encodings it would be more consistent if "a" < "ā" always, regardless of encoding. The UTF encodings are self-synchronizing, so there's no performance cost to this change. It's simple to search for the first byte difference and then find the codepoint at that location.

The SJIS family of encodings should not be changed because
1) they are not self-synchronizing, so sorting by codepoint require parsing the entire string;
2) the current byte ordering results in あ < ア < ア, which is the same as Unicode. Codepoint sorting would result in ア < あ < ア; this incompatibility is not desirable

For these encodings, `str1.casecmp(str2)` scans each codepoint in the two strings and
a) if both codepoints are ascii, lowercase and compare
b) else, throw away the codepoints and compare by byte
so it would be simpler and more efficient to just compare by codepoint (and for SJIS use byte comparison only once two different codepoints are found)


----------------------------------------
Bug #17594: Sort order of UTF-16LE is based on binary representation instead of codepoints
https://bugs.ruby-lang.org/issues/17594#change-90350

* Author: Dan0042 (Daniel DeLorme)
* Status: Open
* Priority: Normal
* Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
I just discovered that string sorting is always based on bytes, so the order of UTF-16LE strings will give some peculiar results:

```ruby
BE, LE = 'UTF-16BE', 'UTF-16LE'
str = [*0..0x4ff].pack('U*').scan(/\p{Ll}/).join

puts str.encode(BE).chars.sort.first(50).join.encode('UTF-8')
#abcdefghijklmnopqrstuvwxyzµßàáâãäåæçèéêëìíîïðñòóôõ

puts str.encode(LE).chars.sort.first(50).join.encode('UTF-8')
#āȁăȃąȅćȇĉȉċȋčȍďȏđȑēȓĕȕėȗęșěțĝȝğȟġȡģȣĥȥħȧĩȩīȫĭȭįȯаı

'a'.encode(BE) < 'ā'.encode(BE) #=> true
'a'.encode(LE) < 'ā'.encode(LE) #=> false
```

Is this supposed to be correct? I mean, I somewhat understand the idea of just sorting by bytes, but I find the above output to be remarkably nonsensical.

A similar/related issue was found and fixed in #8653, so there's precedent for considering codepoints instead of bytes.


The reason I'm asking is because I was working on some optimizations for `String#casecmp` (https://github.com/ruby/ruby/pull/4133) which, as a side-effect, sort by codepoint for UTF-16LE. And that resulted in a different order for `<=>` vs `casecmp`, and thus some tests broke. But I think sorting by codepoint would be better in this case.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread