[ruby-core:70807] [Ruby trunk - Bug #11522] URI::decode returns incorrectly encoding strings

From: duerst@...
Date: 2015-09-15 02:40:36 UTC
List: ruby-core #70807
Issue #11522 has been updated by Martin Dürst.


Nobuyoshi Nakada wrote:
> It has no hints for encoding.

In theory, that's correct. In practice, there are several better possibilities.

1) We can add an additional parameter that indicates the encoding.

2) We can default to UTF-8. That's because most URIs that contain non-ASCII byte values these days are based on UTF-8, and their percentage is increasing steadily.

3) We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0.

1) and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs).

----------------------------------------
Bug #11522: URI::decode returns incorrectly encoding strings
https://bugs.ruby-lang.org/issues/11522#change-54190

* Author: Charlie Anderson
* Status: Rejected
* Priority: Normal
* Assignee: akira yamada
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding.

~~~
irb(main):026:0* unicode = 'œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬'
=> "œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬"
irb(main):027:0> unicode.encoding
=> #<Encoding:UTF-8>
irb(main):028:0> unicode.valid_encoding?
=> true
irb(main):029:0> encoded = URI::encode(unicode)
=> "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC"
irb(main):030:0> encoded.encoding
=> #<Encoding:US-ASCII>
irb(main):031:0> encoded.valid_encoding?
=> true
irb(main):032:0> decoded = URI::decode(encoded)
=> "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC"
irb(main):033:0> decoded.encoding
=> #<Encoding:US-ASCII>
irb(main):034:0> decoded.valid_encoding?
=> false
~~~

I would expect decoded to have a valid encoding - probably as UTF-8?



-- 
https://bugs.ruby-lang.org/

In This Thread

Prev Next