[ruby-core:93251] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
From:
jean.boussier@...
Date:
2019-06-19 15:55:05 UTC
List:
ruby-core #93251
Issue #15940 has been updated by byroot (Jean Boussier).
In order to provide some data, I counted the duplicates in a Redmine heap dump (`ObjectSpace.dump_all`):
Here the counting code:
```ruby
#!/usr/bin/env ruby
# frozen_string_literal: true
require 'json'
fstrings = []
STDIN.each do |line|
object = JSON.parse(line)
fstrings << object if object['fstring']
end
counts = {}
fstrings.each do |str|
counts[str['value']] ||= 0
counts[str['value']] += 1
end
duplicates = counts.select { |k, v| v > 1 }.map(&:first)
puts "total fstrings: #{fstrings.size}"
puts "dups: #{duplicates.size}"
puts "sample:"
puts duplicates.first(20)
```
And the results for Redmine:
```
total fstrings: 84678
dups: 3686
sample:
changes
absent
part
EVENTS
RANGE
OBJECT
Silent
EXCEPTION
Settings
DATE
Index
Graph
COMPLEX
Definition
fcntl
inline
lockfile
update
gemfile
oth
```
That's about 4% of the fstring table being duplicates.
I also ran the script against one much bigger private app, and the duplicate ratio was similar, but the table was an order of magnitude bigger.
----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78701
* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee:
* Target version:
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242
It's not uncommon for symbols to have literal string counterparts, e.g.
```ruby
class User
attr_accessor :name
def as_json
{ 'name' => name }
end
end
```
Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.
Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.
The only notable behavioral change is `Symbol#to_s`.
Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.
However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453
If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:
```ruby
def to_s
str = fstr.dup
str.force_encoding(Encoding::ASCII) if str.ascii_only?
str
end
```
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>