[ruby-core:93402] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
From:
eregontp@...
Date:
2019-06-28 09:26:16 UTC
List:
ruby-core #93402
Issue #15940 has been updated by Eregon (Benoit Daloze).
duerst (Martin Dst) wrote:
> If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with `<<`, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127.
> So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way.
It's not raising an error:
```
$ ruby -ve 's=:abc.to_s; s<<233; p s; p s.encoding'
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
"abc\xE9"
#<Encoding:ASCII-8BIT>
$ ruby -ve 's=:abc.to_s.force_encoding("UTF-8"); s<<233; p s; p s.encoding'
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
"abc蘂
#<Encoding:UTF-8>
```
I'm a bit concerned about compatibility, I think we should evaluate with a few gems, and how much of test-all and specs fail with this change.
I agree in general having a consistent encoding for Symbol literals seems simpler for semantics.
TruffleRuby reuses the underlying memory (byte[], aka char*) for interned Strings of different encodings, so only the metadata (encoding, coderange, etc) is duplicated, but not the actual bytes. Probably MRI could do the same, and that would be transparent and not need to change semantics.
----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78944
* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee:
* Target version:
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242
It's not uncommon for symbols to have literal string counterparts, e.g.
```ruby
class User
attr_accessor :name
def as_json
{ 'name' => name }
end
end
```
Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.
Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.
The only notable behavioral change is `Symbol#to_s`.
Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.
However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453
If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:
```ruby
def to_s
str = fstr.dup
str.force_encoding(Encoding::ASCII) if str.ascii_only?
str
end
```
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>