[#73707] [Ruby trunk Misc#12004] Code of Conduct — hanmac@...
Issue #12004 has been updated by Hans Mackowiak.
3 messages
2016/02/05
[#73730] [Ruby trunk Feature#12034] RegExp does not respect file encoding directive — nobu@...
Issue #12034 has been updated by Nobuyoshi Nakada.
3 messages
2016/02/07
[#73746] [Ruby trunk Feature#12034] RegExp does not respect file encoding directive — nobu@...
Issue #12034 has been updated by Nobuyoshi Nakada.
3 messages
2016/02/09
[#73919] [Ruby trunk Feature#11262] Make more objects behave like "Functions" — Ruby-Lang@...
Issue #11262 has been updated by J旦rg W Mittag.
3 messages
2016/02/22
[#74019] [Ruby trunk Bug#12103][Rejected] ruby process hangs while executing regular expression. — duerst@...
Issue #12103 has been updated by Martin D端rst.
3 messages
2016/02/27
[ruby-core:73689] [Ruby trunk Bug#4044] Regex matching errors when using \W character class and /i option
From:
matthew@...
Date:
2016-02-03 22:44:23 UTC
List:
ruby-core #73689
Issue #4044 has been updated by Matthew Kerwin.
Martin D端rst wrote:
> On 2016/02/03 12:21, matthew@kerwin.net.au wrote:
>
> > I want to write a spec for this, but some of the details are unclear to me. Can we confirm whether each of the following are spec?
>
> Please don't just assume that the current behavior is spec.
Indeed, that's why I asked.
> If it
> doesn't match with common sense in any way, it's very clear that we have
> to fix it. There may be borderline cases that are up for discussion, but
> at least most of the examples I have seen don't meet that criterion.
>
Confusion abounds. I thought that if there was a formal spec, at least that would give a solid grounding to start from. As it is we rely on implementations to describe what should/does happen, which is imperfect and allows us to confuse bugs with spec.
(Right now I'm particularly interested in why `/[\W]/i =~ 'k' #=> nil`)
> My understanding was that Ken Takata fixed the problem with r47598, but
> I'll try to have another look at that.
>
> When I looked at Ken's solution last time
> (the details are at the following link, in Japanese
> https://github.com/k-takata/Onigmo/issues/4), it included some aspects
> related to ASCII, which keeps confusing me.
>
I've looked at that issue, but I'm afraid I can't read Japanese (and Google translate only gets me so far.) I think I get the gist of it, but any subtlety is probably lost to me.
> The relevant specification is Unicode Technical Standard #18, Unicode
> Regular Expressions, in particular
> http://www.unicode.org/reports/tr18/#Simple_Loose_Matches. There are
> various choices at the end of that section that are relevant to this issue.
>
> My personal preference among the choices A-D is B. As far as I
> understand it, it would mean that while a /i option would change how
> literal characters are matched, it would not affect how it affects
> properties such as \W.
>
I suppose we're in choice D at the moment (that would explain why `/\W/i` and `/[\W]/i` match differently,) but just which "specific properties and/or explicit character classes" remains unclear. Documenting those (and writing a spec) would help.
> My justification for this is as follows: If I want e.g. a word
> character, then that already should include all the necessary
> characters, both upper and lower case (and title case just in case you
> forgot about it :-). It's difficult to see why I'd want the set of
> characters to change when adding /i. The same argument can be applied to
> \W and most if not all similar cases.
>
When we were discussing it on Ruby Talk the other day I came up with this:
* the '鍖 ligature is a non-word character
* it has a case conversion, so is affected by the `//i` flag
So:
* `/ff/` is a subset of `/\W/`
* `/ff/i` matches 'ff', 'FF', 'ff', 'fF', and 'Ff'
* therefore `/\W/i` should match all of the above
The first two dot points are where I see the contention. If I were to make a general rule, I'd say that "\W" should not be expanded for case-folding, since 'case' is a property of word characters. (If anything matches "\W" it is, by definition, not a word character, so should not be subject to word-type operations like case-folding.)
If that were so, `/ff/i` (and therefore `/\W/i`) would match 'ff' but not 'FF'.
That would, I think, make `\W` a perfect complement to `\w` (identical to `[^\w]`); which seems to be what people expect.
I think that means you and I are saying the same thing, in different ways.
> The case that I think can be up for discussion is explicit character
> classes, such as [a-z]. Here, in effect automatically adding A-Z (and
> some other case equivalents) may indeed make sense.
Certainly; I use `/[0-9a-f]/i` myself for matching hexadecimal numbers (and similar patterns for similar things.) However where would that leave us with `/[a-e\W]/i` ?
----------------------------------------
Bug #4044: Regex matching errors when using \W character class and /i option
https://bugs.ruby-lang.org/issues/4044#change-56886
* Author: Ben Hoskings
* Status: Closed
* Priority: Normal
* Assignee: Yui NARUSE
* ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
* Backport:
----------------------------------------
=begin
Hi all,
Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)
The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'.
The following expression demonstrates the problem in irb:
puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect }
As a reference, the following two expressions are working properly:
puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect }
puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }
Cheers
Ben Hoskings & Josh Bassett
=end
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>