[ruby-core:77535] Re: [Ruby trunk Feature#12650] Use UTF-8 encoding for ENV on Windows

From: RRRoy BBBean <rrroybbbean@...>
Date: 2016-10-10 02:44:51 UTC
List: ruby-core #77535
PIPES: I wrote a small gem several years ago that handled a problem with 
UTF-8 I/O. The key parts, extracted from their containing module & 
class, are below. This is how I dealt with Hangeul (Korean) characters 
used as data for a non-web application.

     @stdout_callback = 'UTF-8'

     def run
             validate_and_configure
             @stdin, @stdout, @stderr, @wait_thread = Open3.popen3( 
@cmd_text, :chdir=>@cd_path )
             @stdin.set_encoding @stdio_encoding
             @stdout.set_encoding @stdio_encoding
             @stderr.set_encoding @stdio_encoding
             @running =  monitor_stdout && monitor_stderr && attend_thread
     end

DIRECTORY LISTINGS: From some other code, I use this trick to read 
filenames in Hangeul.

Dir.entries(@titles_path,:encoding=>'UTF-8').each {|thing_in_directory| 
... }

FILE I/O with BOM: For file I/O with Hangeul, I use crazy stuff like this.

BOM = "\xEF\xBB\xBF".force_encoding("UTF-8")

Note that some applications (Firefox, Notepad++) recognize the Byte 
Order Mark, and other applications are befuddled when they encounter it. 
I, personally, prefer to use the Byte Order Mark because it immediately 
identifies the file format as UTF-8 (for applications that recognize the 
BOM).

         def strip_bom line
             return nil if line.nil? || line.empty?
             line.force_encoding 'UTF-8'
             line.gsub( BOM, '' )
         end

Also note that when files containing the BOM are concatenated or pasted 
into one-another by BOM-befuddled applications, one or more Byte Order 
Marks can easily become embedded within the data. That's why I use the 
above method.

Anyway, I learned to cope with some of the UTF-8 issues in Ruby, because 
of my work with Korean. I like the way Ruby handles UTF-8 now. although 
it would be nice if everyone could adopt UTF-8 as the de facto standard.

I'm not claiming that my coding techniques are any good, but maybe this 
will help someone.



On 10/07/2016 08:25 PM, ethan_j_brown@hotmail.com wrote:
> Issue #12650 has been updated by Ethan Brown.
>
>
> If you could rethink the plan to wait until Ruby 3, that would be great.
>
> I would expect Ruby to normalize on UTF-8 strings everywhere internally, and only convert to local codepage on the boundary (such as writing to console, file, etc).
>
> We are tracking a number of issues in Puppet that we believe are caused by the current behavior:
>
> * [Puppet Throws Exception when Running Under Unicode Windows User](https://tickets.puppetlabs.com/browse/PUP-6035)
> * [Bundler Fails when Running Under a Unicode Windows User](https://tickets.puppetlabs.com/browse/PUP-6034)
> * [Puppet Crashes when Unicode User Applies Manifest](https://tickets.puppetlabs.com/browse/PUP-5822)
>
> ----------------------------------------
> Feature #12650: Use UTF-8 encoding for ENV on Windows
> https://bugs.ruby-lang.org/issues/12650#change-60787
>
> * Author: Dト」is Mosト]s
> * Status: Open
> * Priority: Normal
> * Assignee:
> ----------------------------------------
> Windows environment variables supports Unicode (same wide WinAPI) and so there's no reason to limit ourselves to any codepage.
> Currently ENV would use locale's encoding (console's codepage) which obviously won't work correctly for characters outside of those codepages.
>
> I've attached a patch which implements this and fixes bug #9715
>
>
> ---Files--------------------------------
> 0001-Always-use-UTF-8-encoded-environment-on-Windows.patch (3.64 KB)
>
>


Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

In This Thread

Prev Next