[#102393] [Ruby master Feature#17608] Compact and sum in one step — sawadatsuyoshi@...

Issue #17608 has been reported by sawa (Tsuyoshi Sawada).

13 messages 2021/02/04

[#102438] [Ruby master Bug#17619] if false foo=42; end creates a foo local variable set to nil — pkmuldoon@...

Issue #17619 has been reported by pkmuldoon (Phil Muldoon).

10 messages 2021/02/10

[#102631] [Ruby master Feature#17660] Expose information about which basic methods have been redefined — tenderlove@...

Issue #17660 has been reported by tenderlovemaking (Aaron Patterson).

9 messages 2021/02/27

[#102639] [Ruby master Misc#17662] The herdoc pattern used in tests does not syntax highlight correctly in many editors — eregontp@...

Issue #17662 has been reported by Eregon (Benoit Daloze).

13 messages 2021/02/27

[#102652] [Ruby master Bug#17664] Behavior of sockets changed in Ruby 3.0 to non-blocking — ciconia@...

Issue #17664 has been reported by ciconia (Sharon Rosner).

23 messages 2021/02/28

[ruby-core:102346] [Ruby master Bug#17594] Sort order of UTF-16LE is based on binary representation instead of codepoints

From: duerst@...
Date: 2021-02-01 09:33:39 UTC
List: ruby-core #102346
Issue #17594 has been updated by duerst (Martin Dürst).


Real high-quality string searching would need language information, because sorting differs by language (e.g. German sorts ä with a, but Swedish sorts it after z). Binary sorting may work for ASCII, but even there, it doesn't consider case equivalence. Implementing high-quality sort is a major project.

UTF-8 is Ruby's main encoding. UTF-16BE and UTF-16LE are 'tolerated', so to say. Because of the nature of UTF-16LE, the result when sorting by bytes is indeed quite nonsensical; your example only scratches the surface. But changing it to sort by code unit (i.e. two bytes at a time) would just introduce a special case for not really much actual gain.

----------------------------------------
Bug #17594: Sort order of UTF-16LE is based on binary representation instead of codepoints
https://bugs.ruby-lang.org/issues/17594#change-90206

* Author: Dan0042 (Daniel DeLorme)
* Status: Open
* Priority: Normal
* Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
I just discovered that string sorting is always based on bytes, so the order of UTF-16LE strings will give some peculiar results:

```ruby
BE, LE = 'UTF-16BE', 'UTF-16LE'
str = [*0..0x4ff].pack('U*').scan(/\p{Ll}/).join

puts str.encode(BE).chars.sort.first(50).join.encode('UTF-8')
#abcdefghijklmnopqrstuvwxyzµßàáâãäåæçèéêëìíîïðñòóôõ

puts str.encode(LE).chars.sort.first(50).join.encode('UTF-8')
#āȁăȃąȅćȇĉȉċȋčȍďȏđȑēȓĕȕėȗęșěțĝȝğȟġȡģȣĥȥħȧĩȩīȫĭȭįȯаı

'a'.encode(BE) < 'ā'.encode(BE) #=> true
'a'.encode(LE) < 'ā'.encode(LE) #=> false
```

Is this supposed to be correct? I mean, I somewhat understand the idea of just sorting by bytes, but I find the above output to be remarkably nonsensical.

A similar/related issue was found and fixed in #8653, so there's precedent for considering codepoints instead of bytes.


The reason I'm asking is because I was working on some optimizations for `String#casecmp` (https://github.com/ruby/ruby/pull/4133) which, as a side-effect, sort by codepoint for UTF-16LE. And that resulted in a different order for `<=>` vs `casecmp`, and thus some tests broke. But I think sorting by codepoint would be better in this case.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread