Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode/utf16: add DecodeRuneBytes #65511

Closed
soypat opened this issue Feb 4, 2024 · 8 comments
Closed

proposal: unicode/utf16: add DecodeRuneBytes #65511

soypat opened this issue Feb 4, 2024 · 8 comments
Labels
Milestone

Comments

@soypat
Copy link

soypat commented Feb 4, 2024

Proposal Details

Hello peeps. I'd like to propose to extract part of the existing code in the encoding/utf16 package into a function.

In particular these lines: https://github.com/golang/go/blob/master/src/unicode/utf16/utf16.go#L116-L130

Why?

  • The main reason for my use case is to avoid allocations.
  • The second reason is to be able to easily decode from a byte slice. Today this operation is not easily done without understanding the internals of utf16.

On existing utf16.DecodeRune

utf16.DecodeRune already exists but its usage is tricky with existing API. One can't use it in the same way as the well designed utf8.DecodeRune since utf16.DecodeRune expects one to have already finished the step of deciding whether the first rune in the []uint16 is a single "self" or a surrogate pair- and there's no API that helps one make this decision!

It is my understanding that utf16.DecodeRune not having the following signature was a missed opportunity to solving the issues I'm currently having:

func DecodeRune(buf []uint16) (r rune, size16 int)

How?

Adding a package level function. Unverified and untested function logic included for visual purposes.

func DecodeRuneBytes(srcUTF16 []byte, order16 binary.ByteOrder) (r rune, size int) {
	// UTF16 values.
	const (
		// 0xd800-0xdc00 encodes the high 10 bits of a pair.
		// 0xdc00-0xe000 encodes the low 10 bits of a pair.
		// the value is those 20 bits plus 0x10000.
		surr1 = 0xd800
		surr2 = 0xdc00
		surr3 = 0xe000
	)
	var r1, r2 rune

	slen := len(srcUTF16)
	if slen == 0 {
		return '\uFFFD', 1
	}
	r1 = rune(order16.Uint16(srcUTF16))
	if slen >= 4 {
		r2 = rune(order16.Uint16(srcUTF16[2:]))
	}
	var ar rune
	switch {
	case r1 < surr1, surr3 <= r1:
		// normal rune
		ar = r1
		size = 2
	case surr1 <= r1 && r1 < surr2 && slen >= 4 &&
		surr2 <= r2 && r2 < surr3:
		// valid surrogate sequence
		ar = DecodeRune(r1, r2)
		size = 4
	default:
		// invalid surrogate sequence
		ar = '\uFFFD'
		size = 1
	}
	return ar, size
}

Note: Here's an example of the API implemented in a package and it's usage.

@soypat
Copy link
Author

soypat commented Feb 4, 2024

I've just noticed that I'd also greatly benefit from adding a EncodeRuneBytes. This way conversion routines between different formats would be greatly simplified.

func EncodeRuneBytes(dst []byte, r rune) int

@adonovan
Copy link
Member

adonovan commented Feb 5, 2024

The DecodeRuneBytes function adds a new concept to the utf16 package, namely the encoding of sequences of UTF-16 codes as byte strings, with the concomitant need to specify the byte order.

If we make the byte encoding the caller's concern, then can't the problem be solved with code something like this?

   a := next()
   if !utf16.IsSurrogate(a) {
       return a
   } 
   b := next()
   if !utf16.IsSurrogate(b) {
       return 0xFFFD
   }
   return utf16.DecodeRune(a, b)

(Replace 'next' with your favorite byte iterator.)

@soypat
Copy link
Author

soypat commented Feb 5, 2024

@adonovan Hmm, rune b should actually be checked with a different function. IsSurrogate does: surr1 <= r && r < surr3 while rune b is invalid if it does not fulfill surr2 <= r && r < surr3. But you are close. (close as in that could work if there was a IsSecondSurrogate)

@adonovan
Copy link
Member

adonovan commented Feb 6, 2024

@adonovan Hmm, rune b should actually be checked with a different function. IsSurrogate does: surr1 <= r && r < surr3 while rune b is invalid if it does not fulfill surr2 <= r && r < surr3. But you are close. (close as in that could work if there was a IsSecondSurrogate)

DecodeRune applies the correct range check to each surrogate, so my example will return the correct rune; however, when it returns U+FFFD it may consume too much: if 'a' was invalid it should not consume 'b'. To fix that we would need not only IsHighSurrogate (that's the usual term for "second") for 'b', but also IsLowSurrogate for 'a'.

@seankhliao seankhliao changed the title proposal: unicode/utf16: Add DecodeRuneBytes proposal: unicode/utf16: add DecodeRuneBytes Feb 6, 2024
@ianlancetaylor
Copy link
Contributor

CC @dsnet

@rsc
Copy link
Contributor

rsc commented Mar 13, 2024

The utf16 package is meant to be as minimal as possible. If you really need to decode bytes, it is easy to decode them outside the loop and then call DecodeRune with successive pairs of uint16 values.

@rsc
Copy link
Contributor

rsc commented Mar 13, 2024

Also we probably do not want utf16 to import encoding/binary and then depend recursively on many other packages, including reflect. Right now utf16 has no imports at all, and it is important to keep it that way for use on Windows in package syscall.

@rsc
Copy link
Contributor

rsc commented Mar 15, 2024

This proposal has been declined as infeasible.
— rsc for the proposal review group

@rsc rsc closed this as completed Mar 15, 2024
@rsc rsc changed the title proposal: unicode/utf16: add DecodeRuneBytes proposal: unicode/utf16: add DecodeRuneBytes Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Declined
Development

No branches or pull requests

5 participants