[ruby-core:93413] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
From:
ruby@...
Date:
2019-06-28 16:33:51 UTC
List:
ruby-core #93413
Issue #15940 has been updated by nirvdrum (Kevin Menard).
I generally like the idea, but really from a semantics perspective rather than a memory savings one. It's confusing to both implementers and end users alike that Symbols take on a different encoding from Strings if they happen to be ASCII-only. So the other nice benefit of the change is `String#{intern,to_sym}` can be made much cheaper. Having said all of that, I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason. I think it'd be good to look back and see what the rationale was.
If the solution then is to convert the String's encoding when calling `Symbol#to_s`, if the Symbol is ASCII-only, then I think you're going to investigate knock-on effects. E.g., `String#force_encoding` currently unconditionally clears the String's code range. That's metadata you really don't want to lose. But, by setting the encoding to ASCII-only, you may be okay most of the time because there are code paths that just check if the encoding uses single byte characters without doing a full code range scan. Likewise, if you do decide to skip the `US-ASCII` conversion, you could have the inverse problem. Now you have a UTF-8 string and if that doesn't have its code range set, you've turned some O(1) operations to O(n). Please note, I haven't really traced all the String and Symbol code. These were potential pitfalls that stood out to me when reviewing the PR and looking briefly at the CRuby source. My general point being that even if things come out correct, you could still alter the e
xecution profile in such a way as to introduce a performance regression by changing from a fixed-width to a variable-width encoding or by not taking proper care of the code range value. None of that is insurmountable, of course.
----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78955
* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee:
* Target version:
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242
It's not uncommon for symbols to have literal string counterparts, e.g.
```ruby
class User
attr_accessor :name
def as_json
{ 'name' => name }
end
end
```
Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.
Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.
The only notable behavioral change is `Symbol#to_s`.
Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.
However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453
If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:
```ruby
def to_s
str = fstr.dup
str.force_encoding(Encoding::ASCII) if str.ascii_only?
str
end
```
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>