proposal: spec: allow combining characters in identifiers #20706

rsc · 2017-06-16T20:34:02Z

Forking from #16033, which had two related but different proposals in it. The proposal for this issue, by @robpike:

On a related note, some writing systems - Devanagari is one (see #5167) require combining characters. The current identifier rules forbid combining characters; perhaps that should be relaxed, although that will require a canonicalization rule for combining characters. Unicode does have a definition for identifiers (http://unicode.org/reports/tr31/); perhaps Go should use it. Note that the addition of combining characters, allied with the export proposal above, would make it possible to export Devanagari identifiers.

rsc · 2017-06-16T20:36:36Z

Re canonicalization, one possibility Rob and I discussed at one point was to require in the spec that implementations canonicalize during comparisons to establish whether two identifiers are the same but also to have gofmt canonicalize to generate its output (the former is required for the latter to be semantically safe). Then source code is consistent but the compilers will deal if not.

griesemer · 2017-06-16T20:42:45Z

@rsc Is this Go 2 or would you consider this for Go 1?

rsc · 2017-06-17T17:47:20Z

For Go 2.

rsc · 2017-06-17T17:48:40Z

Merging #5167 in here. From suraj@barkale.com in 2013:

My suggestion is to amend Go specification by allowing combining-mark & non-spacing-mark characters in identifiers.

This will be similar to Java identifier rules given at http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierPart(char).
A character may be part of a Java identifier if any of the following are true:

it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for the character

bakul · 2017-07-19T02:02:37Z

Note that the Java id rule link is invalid now.

The first char of an identifier is typically more constrained (e.g. no digit).

I propose that as far as Indic languages are concerned, an identifier may not start with one of various sign chars, digits, dependent vowels or "virama". An identifier may start with a currency sign or an "om" sign is allowed or an independent vowel (vowel letters) or a consonant.

robpike · 2017-07-19T03:02:00Z

@bakul It's precisely that kind of minute precision we'd like to avoid. The rule must be very simple to express, as it is now. We are seeking a new but also simple-to-express rule that admits more classes of identifier without requiring natural-language-specific detail.

bakul · 2017-07-19T03:17:31Z

Would "१२३" (123 in devanagari) be considered a number or an identifier? If the latter, that would be strange!

If you don't want nitpicky rules, a simpler rule may be that the first char satisfies unicode.IsLetter() and the succeeding chars satisfy unicode.IsLetter() or IsDigit() or IsMark(). Plus whatever is needed for CJK.

bakul · 2017-07-19T03:21:55Z

I should add: along with identifiers, numbers should also be expressible in other languages (but I expect that'll be even more unpopular!)

bakul · 2018-09-19T08:14:04Z

In case it makes a difference, the Swift programming language allows legitimate Indic words as identifiers. From Swift Lexical Structure:

Identifiers begin with an uppercase or lowercase letter A through Z, an underscore (_), a noncombining alphanumeric Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plane that isn’t in a Private Use Area. After the first character, digits and combining Unicode characters are also allowed.

Most indian language words use combining chars so it would be good to fix this.

If these identifiers are not exportable, that is fine. Prefixing with an uppercase letter from some other script is ugly but that can't be helped! I am tempted to suggest using the section symbol (§) or some such symbol as an additional exportable start char of an identifier.

bcmills · 2018-11-29T18:04:03Z

that will require a canonicalization rule for combining characters.

See #27896 for a proposal specifically about canonicalization.

That potentially affects existing programs, since (for example) μ and µ are both already allowed (and treated as distinct identifiers) in Go source code.

aarzilli · 2018-11-29T18:59:33Z

Re canonicalization, one possibility Rob and I discussed at one point was to require in the spec that implementations canonicalize during comparisons to establish whether two identifiers are the same but also to have gofmt canonicalize to generate its output (the former is required for the latter to be semantically safe). Then source code is consistent but the compilers will deal if not.

I would prefer that non-normalized identifier were rejected so I wouldn't have to worry that my grep/editor/browser uses the same normalization strategy as gofmt/compiler when looking for identifiers even for sources that weren't visited by gofmt.

mpvl · 2018-12-07T13:43:45Z

@bcmills: forcing NFKC is not backwards compatible. If we break backwards compatibility, I would prefer to simply disallow any character with a decomposition type "font" and permit all others as is.

bcmills · 2018-12-07T14:33:02Z

If we break backwards compatibility, I would prefer to simply disallow any character with a decomposition type "font" and permit all others as is.

I think that would lead to a pretty unfortunate user experience. Consider the snippet:

	var jalapeño = "🌶️"
	fmt.Println(jalapeño)

Without any sort of normalization, the otherwise-equivalent identifiers jalapeño and jalapeño refer to two completely different variables, and as far as I am aware none of the characters involved have decomposition type font.

griesemer · 2019-02-05T22:21:21Z

In https://blog.golang.org/go2-here-we-come we propose to adopt a version of Unicode's TR-31 specification.

gocs · 2019-05-06T11:54:55Z

w̶e̸ ̵m̵i̴g̵h̷t̸ ̷h̴a̸v̷e̴ ̴a̶ ̴p̶r̸o̷b̸l̷e̴m̷

do we allow zalgo text and alike here?

bcmills · 2019-05-06T12:47:59Z

@gocs, you can already write intentionally-obfuscated code today (for example, mixing Latin and Cyrillic vowels). Can you give specific examples where you believe the current restriction has prevented someone from writing an unreasonable identifier they would have chosen otherwise?

eric-hawthorne · 2019-07-16T22:58:22Z

Arbitrary unicode identifiers in a programming language are a bad idea. By allowing arbitrary nonsensical or deliberately misleading mixings of characters from different natural languages, and also because of the non 1-1 mapping from glyph of the character to encoding, it will just allow programs to become more incomprehensible and unmaintainable, considered as a whole corpus of programs.
It could also discourage international free/open source code sharing.

Identifiers should stay ASCII + digits, maybe with a handful of greek letters thrown in as a concession to those wishing to express mathematics, engineering, and physics.

In this day and age, strings and comments in the language should be able to contain arbitrary unicode characters, but the code should be treated as standardized and constrained and simple formal expression of math and logic, in a small, simple alphabet.

The program expression philosophy especially for free/open source should be more like how international aviation has standardized on English communications, for safe interoperability.

Sometimes less is more. Remember the ghost of PL/1. Worst programming language in the known universe because it tried to be everything to everyone.

I know go from the get-go has permitted some unicode in identifiers, so this is water under the bridge.
But don't increase the torrent of incomprehensible flotsam and jetsam now.

taralx · 2019-08-01T17:50:06Z

I disagree that gofmt should canonicalize. What I think it should do is unify - change all uses of an identifier to match the character sequence used at the declaration.

The only case that is a bit odd then is what to do about shadowing declarations - do we unify those too, or let them be unrelated?

bcmills · 2019-08-02T04:11:25Z

@taralx, the problem with unifying to the declaration site is that it is hostile to byte-oriented searches. If I declare a method AñadirJalapeños, and you declare a method AñadirJalapeños, then either they're the same method (and can both be used as the same interface) but with different representations in the source code, or they're different methods but due to encoding only.

Neither of those options seems particularly better than canonicalizing the source.

taralx · 2019-08-02T06:47:34Z

Okay, so interfaces make that more difficult. My concern is that many canonicalization schemes result in aggressive switching from one code block to another (e.g. reducing superscript numbers to regular numbers or full-width symbols to half-width symbols). If the plan is to use only mild canonicalization that does not do this, then this will not be an issue for me, at least.

bakul · 2020-10-28T21:39:42Z

What is the current thinking on this? On re-reading this thread, the first comment here by Russ captures exactly what I want (vis-a-vis Indic languages). What is the issue with simply following Unicode TR31? It talks about normalization (not canonicalization) in section 5. What specific issues does Go run into beyond what is discussed there? Since many languages do not have upper/lower case distinctions you'd need something to indicate either what is a public or private ID. But beyond that I don't see anything special in Go. Note that Python, Swift, Nim & at least one implementation of Scheme are following it fine. I can always switch to one of these languages for some Sanskrit related work I am doing but it would be really nice if Go2 supported such identifiers!

ianlancetaylor · 2020-10-28T22:23:00Z

As far as I know Unicode TR31 is still the current thinking. But it will take some work to get there.

smasher164 · 2022-10-16T18:45:28Z

If anyone wants to get a feel for what sorts of identifiers TR31 permits, I have a package implemented here: github.com/smasher164/xid.

As others have mentioned, NFKC isn't backwards-compatible with existing Go programs. However, NFKC is at least backwards-compatible with itself. This means that an identifier that's normalized in on version of unicode will remain normalized in the future.

One option to accept a greater set of identifiers would be to take the union of XID_Start XID_Continue* and identifier = letter unicode_digit (the current set), where letter = unicode_letter | '_'.

smasher164 · 2023-06-07T18:14:53Z

Like with the loop variable change, I feel like this sort of breaking change is something that will actually be a net positive. If there's any code out there with distinct identifiers that are equivalent under normalization that actually compiles, it's likely broken. Using the module version to opt into NFKC-normalized identifiers seems like solid strategy to me.

gopherbot added this to the Proposal milestone Jun 16, 2017

gopherbot added the Proposal label Jun 16, 2017

robpike mentioned this issue Jun 16, 2017

proposal: spec: export uncased identifiers like 日本語 #16033

Closed

dsnet added the LanguageChange label Jun 16, 2017

rsc added the v2 A language change or incompatible library change label Jun 17, 2017

rsc mentioned this issue Jun 17, 2017

spec: combining mark & non spacing mark as invalid go identifiers make it pointless to use Devanagari identifiers #5167

Closed

rsc mentioned this issue Oct 2, 2017

proposal: spec: export uncased identifiers like 日本語 #5763

Closed

ianlancetaylor added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Feb 20, 2018

ALTree mentioned this issue Jul 25, 2019

cmd/compile: NFD-normalized unicode identifiers result in malformed error messages #33271

Closed

gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Aug 16, 2019

gopherbot added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Sep 3, 2019

ianlancetaylor added this to Incoming in Proposals (old) Oct 28, 2020

ianlancetaylor removed this from Incoming in Proposals (old) Oct 28, 2020

ianlancetaylor removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 28, 2020

ianlancetaylor mentioned this issue Nov 26, 2020

Unable to use some Hindi unicode characters in source code as identifiers #42830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: spec: allow combining characters in identifiers #20706

proposal: spec: allow combining characters in identifiers #20706

rsc commented Jun 16, 2017

rsc commented Jun 16, 2017

griesemer commented Jun 16, 2017

rsc commented Jun 17, 2017

rsc commented Jun 17, 2017

bakul commented Jul 19, 2017

robpike commented Jul 19, 2017

bakul commented Jul 19, 2017

bakul commented Jul 19, 2017

bakul commented Sep 19, 2018

bcmills commented Nov 29, 2018

aarzilli commented Nov 29, 2018

mpvl commented Dec 7, 2018

bcmills commented Dec 7, 2018 •

edited

griesemer commented Feb 5, 2019

gocs commented May 6, 2019 •

edited

bcmills commented May 6, 2019

eric-hawthorne commented Jul 16, 2019 •

edited

taralx commented Aug 1, 2019

bcmills commented Aug 2, 2019

taralx commented Aug 2, 2019

bakul commented Oct 28, 2020

ianlancetaylor commented Oct 28, 2020

smasher164 commented Oct 16, 2022

smasher164 commented Jun 7, 2023

proposal: spec: allow combining characters in identifiers #20706

proposal: spec: allow combining characters in identifiers #20706

Comments

rsc commented Jun 16, 2017

rsc commented Jun 16, 2017

griesemer commented Jun 16, 2017

rsc commented Jun 17, 2017

rsc commented Jun 17, 2017

bakul commented Jul 19, 2017

robpike commented Jul 19, 2017

bakul commented Jul 19, 2017

bakul commented Jul 19, 2017

bakul commented Sep 19, 2018

bcmills commented Nov 29, 2018

aarzilli commented Nov 29, 2018

mpvl commented Dec 7, 2018

bcmills commented Dec 7, 2018 • edited

griesemer commented Feb 5, 2019

gocs commented May 6, 2019 • edited

bcmills commented May 6, 2019

eric-hawthorne commented Jul 16, 2019 • edited

taralx commented Aug 1, 2019

bcmills commented Aug 2, 2019

taralx commented Aug 2, 2019

bakul commented Oct 28, 2020

ianlancetaylor commented Oct 28, 2020

smasher164 commented Oct 16, 2022

smasher164 commented Jun 7, 2023

bcmills commented Dec 7, 2018 •

edited

gocs commented May 6, 2019 •

edited

eric-hawthorne commented Jul 16, 2019 •

edited