proposal: unicode/utf16: add DecodeRuneBytes #65511

soypat · 2024-02-04T14:10:23Z

Proposal Details

Hello peeps. I'd like to propose to extract part of the existing code in the encoding/utf16 package into a function.

In particular these lines: https://github.com/golang/go/blob/master/src/unicode/utf16/utf16.go#L116-L130

Why?

The main reason for my use case is to avoid allocations.
The second reason is to be able to easily decode from a byte slice. Today this operation is not easily done without understanding the internals of utf16.

On existing `utf16.DecodeRune`

utf16.DecodeRune already exists but its usage is tricky with existing API. One can't use it in the same way as the well designed utf8.DecodeRune since utf16.DecodeRune expects one to have already finished the step of deciding whether the first rune in the []uint16 is a single "self" or a surrogate pair- and there's no API that helps one make this decision!

It is my understanding that utf16.DecodeRune not having the following signature was a missed opportunity to solving the issues I'm currently having:

func DecodeRune(buf []uint16) (r rune, size16 int)

How?

Adding a package level function. Unverified and untested function logic included for visual purposes.

func DecodeRuneBytes(srcUTF16 []byte, order16 binary.ByteOrder) (r rune, size int) {
	// UTF16 values.
	const (
		// 0xd800-0xdc00 encodes the high 10 bits of a pair.
		// 0xdc00-0xe000 encodes the low 10 bits of a pair.
		// the value is those 20 bits plus 0x10000.
		surr1 = 0xd800
		surr2 = 0xdc00
		surr3 = 0xe000
	)
	var r1, r2 rune

	slen := len(srcUTF16)
	if slen == 0 {
		return '\uFFFD', 1
	}
	r1 = rune(order16.Uint16(srcUTF16))
	if slen >= 4 {
		r2 = rune(order16.Uint16(srcUTF16[2:]))
	}
	var ar rune
	switch {
	case r1 < surr1, surr3 <= r1:
		// normal rune
		ar = r1
		size = 2
	case surr1 <= r1 && r1 < surr2 && slen >= 4 &&
		surr2 <= r2 && r2 < surr3:
		// valid surrogate sequence
		ar = DecodeRune(r1, r2)
		size = 4
	default:
		// invalid surrogate sequence
		ar = '\uFFFD'
		size = 1
	}
	return ar, size
}

Note: Here's an example of the API implemented in a package and it's usage.

The text was updated successfully, but these errors were encountered:

soypat · 2024-02-04T14:28:31Z

I've just noticed that I'd also greatly benefit from adding a EncodeRuneBytes. This way conversion routines between different formats would be greatly simplified.

func EncodeRuneBytes(dst []byte, r rune) int

adonovan · 2024-02-05T22:21:14Z

The DecodeRuneBytes function adds a new concept to the utf16 package, namely the encoding of sequences of UTF-16 codes as byte strings, with the concomitant need to specify the byte order.

If we make the byte encoding the caller's concern, then can't the problem be solved with code something like this?

   a := next()
   if !utf16.IsSurrogate(a) {
       return a
   } 
   b := next()
   if !utf16.IsSurrogate(b) {
       return 0xFFFD
   }
   return utf16.DecodeRune(a, b)

(Replace 'next' with your favorite byte iterator.)

soypat · 2024-02-05T23:31:27Z

@adonovan Hmm, rune b should actually be checked with a different function. IsSurrogate does: surr1 <= r && r < surr3 while rune b is invalid if it does not fulfill surr2 <= r && r < surr3. But you are close. (close as in that could work if there was a IsSecondSurrogate)

adonovan · 2024-02-06T02:49:24Z

@adonovan Hmm, rune b should actually be checked with a different function. IsSurrogate does: surr1 <= r && r < surr3 while rune b is invalid if it does not fulfill surr2 <= r && r < surr3. But you are close. (close as in that could work if there was a IsSecondSurrogate)

DecodeRune applies the correct range check to each surrogate, so my example will return the correct rune; however, when it returns U+FFFD it may consume too much: if 'a' was invalid it should not consume 'b'. To fix that we would need not only IsHighSurrogate (that's the usual term for "second") for 'b', but also IsLowSurrogate for 'a'.

ianlancetaylor · 2024-02-07T00:08:33Z

CC @dsnet

rsc · 2024-03-13T17:52:34Z

The utf16 package is meant to be as minimal as possible. If you really need to decode bytes, it is easy to decode them outside the loop and then call DecodeRune with successive pairs of uint16 values.

rsc · 2024-03-13T17:53:24Z

Also we probably do not want utf16 to import encoding/binary and then depend recursively on many other packages, including reflect. Right now utf16 has no imports at all, and it is important to keep it that way for use on Windows in package syscall.

rsc · 2024-03-15T01:13:24Z

This proposal has been declined as infeasible.
— rsc for the proposal review group

soypat added the Proposal label Feb 4, 2024

gopherbot added this to the Proposal milestone Feb 4, 2024

soypat mentioned this issue Feb 4, 2024

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

Open

seankhliao changed the title ~~proposal: unicode/utf16: Add DecodeRuneBytes~~ proposal: unicode/utf16: add DecodeRuneBytes Feb 6, 2024

rsc closed this as completed Mar 15, 2024

rsc changed the title ~~proposal: unicode/utf16: add DecodeRuneBytes~~ proposal: unicode/utf16: add DecodeRuneBytes Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: unicode/utf16: add DecodeRuneBytes #65511

proposal: unicode/utf16: add DecodeRuneBytes #65511

soypat commented Feb 4, 2024 •

edited

soypat commented Feb 4, 2024

adonovan commented Feb 5, 2024

soypat commented Feb 5, 2024 •

edited

adonovan commented Feb 6, 2024

ianlancetaylor commented Feb 7, 2024

rsc commented Mar 13, 2024

rsc commented Mar 13, 2024

rsc commented Mar 15, 2024

proposal: unicode/utf16: add DecodeRuneBytes #65511

proposal: unicode/utf16: add DecodeRuneBytes #65511

Comments

soypat commented Feb 4, 2024 • edited

Proposal Details

Why?

On existing utf16.DecodeRune

How?

soypat commented Feb 4, 2024

adonovan commented Feb 5, 2024

soypat commented Feb 5, 2024 • edited

adonovan commented Feb 6, 2024

ianlancetaylor commented Feb 7, 2024

rsc commented Mar 13, 2024

rsc commented Mar 13, 2024

rsc commented Mar 15, 2024

soypat commented Feb 4, 2024 •

edited

On existing `utf16.DecodeRune`

soypat commented Feb 5, 2024 •

edited