Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: add decode from TokenReader, to enable stream transformers #19480

Closed
SamWhited opened this issue Mar 9, 2017 · 18 comments
Closed

Comments

@SamWhited
Copy link
Member

SamWhited commented Mar 9, 2017

This is a tracking issue for this design document.

Abstract

The encoding/xml package contains an API for tokenizing an XML stream, but no
API exists for processing or manipulating the resulting token stream.
This proposal describes such an API.

Proposed API

// A Tokenizer is anything that can decode a stream of XML tokens, including an
// xml.Decoder.
type Tokenizer interface {
	Token() (xml.Token, error)
	Skip() error
}

// A Transformer is a function that takes a Tokenizer and returns a new
// Tokenizer which outputs a transformed token stream.
type Transformer func(src Tokenizer) Tokenizer

// Inspect performs an operation for each token in the stream without
// transforming the stream in any way.
// It is often injected into the middle of a transformer pipeline for debugging.
func Inspect(f func(t xml.Token)) Transformer {}

// Map transforms the tokens in the input using the given mapping function.
func Map(mapping func(t xml.Token) xml.Token) Transformer {}

// Remove returns a Transformer that removes tokens for which f matches.
func Remove(f func(t xml.Token) bool) Transformer {}

// RemoveElement returns a Transformer that removes entire elements (and their
// children) if f matches the elements start token.
func RemoveElement(f func(start xml.StartElement) bool) Transformer {}

Please see the formal proposal for a list of open questions and justification.

Example implementation: https://godoc.org/mellium.im/xmlstream
Design doc: https://golang.org/design/19480-xml-stream

@SamWhited
Copy link
Member Author

SamWhited commented Mar 10, 2017

Other considerations:

If you use this sort of API, people will want to write / use consumers. Since consumers would be guided by the Tokenizer API, what would they look like (and can we do better)? Right now I think they'd have to be functions like the following:

// Encode consumes a tokenizer and reencodes its tokens.
// If an error is returned on the Decode or Encode side, it is returned
// immediately.
// Since Encode is defined as consuming the stream until the end, io.EOF is not
// returned.
// If no error would be returned, Encode flushes the underlying encoder when it
// is done.
func Encode(e *xml.Encoder, t Tokenizer) error {}

example usage:

func ExampleEncode() {
	removequote := xmlstream.Remove(func(t xml.Token) bool {
		switch tok := t.(type) {
		case xml.StartElement:
			return tok.Name.Local == "quote"
		case xml.EndElement:
			return tok.Name.Local == "quote"
		}
		return false
	})

	e := xml.NewEncoder(os.Stdout)
	xmlstream.Encode(e, removequote(xml.NewDecoder(strings.NewReader(`
<quote>
  <p>Foolery, sir, does walk about the orb, like the sun; it shines everywhere.</p>
</quote>`))))
	// Output:
	// <p>Foolery, sir, does walk about the orb, like the sun; it shines everywhere.</p>
}

I vaguely feel that this API could be improved upon, but I'll have to think about it.

@SamWhited
Copy link
Member Author

SamWhited commented Mar 26, 2017

More thoughts:

I reread this proposal today and I don't think I made this clear, but one reason it might be good to have this in the encoding/xml package itself (or in a new alternate XML package in the subrepos that can break compatibility with and fix some of the bugs in the standard library version) is that it opens up the possibility of doing something like this (for backwards compatability with existing APIs that expect a *xml.Decoder and for the convenience of being able to use Marshal/Unmarshal to go to/from structs):

// Creates a new Decoder that unmarshals based on the token stream returned by t.
func NewTokenDecoder(d *Decoder, t Tokenizer) *Decoder {}

Which means that something like the following would work:

// Unmarshals "foo" into an "bar" struct
fooMap := xml.Map(func(t xml.Token) xml.Token {
	switch tok := t.(type) {
	case xml.StartElement:
		if tok.Name.Local == "foo" {
			tok.Name.Local = "bar"
			return tok
		}
	case xml.EndElement:
		if tok.Name.Local == "foo" {
			tok.Name.Local = "bar"
			return tok
		}
	}
	return t
})

const somexml = "<foo>Test</foo>"
d := xml.NewDecoder(strings.NewReader(somexml))
d = xml.NewTokenDecoder(d, fooMap(d))

s := struct {
	XMLName xml.Name `xml:"bar"`
	Text    string   `xml:",cdata"`
}{}
d.Unmarshal(&s)

assert(s.Text == "Test")

Wraping a Tokenizer in a Concrete decoder struct is trivial to implement, but what's not exactly clear to me if we ever wanted to do this is what RawToken would become. Would it bypass the transformation entirely and continue to return exactly what it would have for the underlying tokenizer, or would it continue to not perform namespace substitution but still have whatever the transformation is (and if so, how does the transformation know what to do with it, do we have to add RawToken to the tokenizer API too?)

@rsc
Copy link
Contributor

rsc commented Mar 27, 2017

Can this be done outside the package? If not, what is fundamentally required in encoding/xml to make this possible to develop outside?

@SamWhited
Copy link
Member Author

SamWhited commented Mar 27, 2017

It depends if we wanted some sort of concrete-Decoder-that-wraps-a-tokenizer compatibility API like I mentioned in my previous comment (which I think would be very useful; I'd also love to be able to use methods like Unmarshal without reimplementing all the struct tag reading logic). If we do want that, we would have to do this in the encoding/xml package to avoid having an import loop (unless it were still in the standard library but business logic was moved into an internal/xmlhelper package or something? I'm not sure if that would work, just thought of it).

Alternatively, I've been thinking for a while now that it might make sense to think about deprecating encoding/xml and copy/pasting it to the subrepos. Maintaining a version there means we could fix some of the namespace and decoding bugs that are now part of the contract and can't be changed, take advantage of other things in the subrepos like the text/transform package for escape/unescaping XML, etc. We could do something like this there to keep it out of the standard library.

@rsc
Copy link
Contributor

rsc commented Mar 29, 2017

I'm not interested in deprecating encoding/xml, but I'm also not interested in having it balloon into bigger things. Let's keep doing what we're doing, but let's leave transforms to third-party packages.

@SamWhited
Copy link
Member Author

Let's keep doing what we're doing, but let's leave transforms to third-party packages.

Alrighty; it would be nice to have some way to use these in existing Decoder's though (so that if you want to Unmarshal you don't have to reimplement all the logic for a transformer)

@rsc
Copy link
Contributor

rsc commented Mar 29, 2017

Adding

type TokenReader interface {
    Token() (Token, error)
}

func NewTokenDecoder(r TokenReader) *Decoder

seems OK, provided it's not much code.

@SamWhited
Copy link
Member Author

That makes sense to me (and is minimal, I've got a branch ready on my other machine that has an implementation of the NewTokenDecoder method, I'll submit a CL later today).

@SamWhited
Copy link
Member Author

SamWhited commented Mar 29, 2017

What behavior RawToken should have in this case is still unclear to me, otherwise though I've pushed up a CL that demonstrates the change (I will write more tests as soon as the behavior is certain).

EDIT: Gobot's slacking off today: https://go-review.googlesource.com/c/38791/

@SamWhited
Copy link
Member Author

Another interesting side effect of this proposal which I hadn't considered originally is that this allows the XML package to decode any tokenizer that outputs XML tokens, even if the original input was not XML. It can be used to easily write "codecs" which translate other formats into XML at the token level instead of having to deal with wrapping an underlying io.Reader and operating at the byte level or decoding an entire serialized object and then reencoding.

For instance, one could feed an XMPP library a special decoder that took CBOR (which would be sent over the wire) as its input, and converted it to XML, or which decoded the base profile of EXI (a binary compression format for XML) without the underlying XML library (encoding/xml) requiring support for EXI at all.

@SamWhited
Copy link
Member Author

SamWhited commented Apr 2, 2017

Another thought: Maybe this should have a RawToken() method instead; that solves the problem of what to do with calls to RawToken and means that Token() can keep working exactly as it does today without any changes (since it calls RawToken under the hood). I don't think this makes TokenReader implementations any harder to write. New API would be:

type TokenReader interface {
    RawToken() (Token, error)
}

func NewTokenDecoder(r TokenReader) *Decoder

EDIT: CL updated while I play with and think about this API.
EDIT: After playing with this a bit, it appears to make some transformations easier to implement because you can do a more "direct" transformation and not worry about fixing namespaces, or if you're matching the semantics of the Decoder's Token method and the checks it does.
EDIT: The only other real issue I ran into is that implementing an operation like RemoveElement is difficult now; I'm not sure where else including Skip() error in the interface would be useful, and it's a bit annoying to have to proxy it for every implementation of the interface, but it does make at least that one thing a lot easier to implement.

@rsc
Copy link
Contributor

rsc commented Apr 10, 2017

@SamWhited, what's the current status here? I see that CL 38791 defines:

// A TokenReader is anything that can decode a stream of XML tokens, including a
// Decoder.
type TokenReader interface {
	RawToken() (Token, error)
	Skip() error
}

I don't see why Skip is needed. The Decoder implementation of Skip just calls Token/RawToken repeatedly until it finds the closing tag it wants. Why can't any consumer of a TokenReader do the same?

As for Token vs RawToken, it seems like there are two main differences: (1) RawToken doesn't guarantee well-formed (properly matched) input, which I hope is not a concern here, and (2) RawToken doesn't have to do namespace parsing and expansion. That's the big question in my mind: if you're doing transformation you probably do want to see the full name space info, right? So then the input back into the Decoder is going to have expanded name space info, so it should be Token not RawToken.

RawToken can just return an error for a Decoder constructed by NewTokenDecoder.

@SamWhited
Copy link
Member Author

SamWhited commented Apr 10, 2017

I don't see why Skip is needed. The Decoder implementation of Skip just calls Token/RawToken repeatedly until it finds the closing tag it wants. Why can't any consumer of a TokenReader do the same?

That's fair; I suppose it's not that much extra stuff to reimplement so it's probably worth leaving it off just to keep the interface small.

That's the big question in my mind: if you're doing transformation you probably do want to see the full name space info, right? So then the input back into the Decoder is going to have expanded name space info, so it should be Token not RawToken.

It does make things a little easier in some cases (no searching for the xmlns or xmlns:prefix attributes). I could go either way, the only reason I lean towards RawToken is the issue of what to do with the RawToken method on decoders that wrap a TokenReader; as you said:

RawToken can just return an error for a Decoder constructed by NewTokenDecoder.

While I suppose that's true, it feels poor to not ever be sure if RawToken will work on any given Decoder (since once you've wrapped a tokenizer, there would be no way to tell a normal decoder apart from a special transformer decoder).

It doesn't technically break the Decoder API since RawToken could always return an error, but before I would have assumed that an error from RawToken was a read error, now I have to assume that it could not actually be an "error" per se just a signal that RawToken isn't actually a method I should be using (even though it is a method on the struct). This doesn't feel great to me.

@rsc
Copy link
Contributor

rsc commented Apr 10, 2017

What do the transformers you've written assume about the form of the name space data? That seems like the key consideration.

@SamWhited
Copy link
Member Author

SamWhited commented Apr 10, 2017

What do the transformers you've written assume about the form of the name space data? That seems like the key consideration.

I've written them both ways (using Token and RawToken) to see what the API "felt like"; there wasn't much difference. The only exception is RemoveElement, which is harder without Skip (but as you pointed out, not hard enough to justify making every other transformer proxy the Skip method probably).

@gopherbot
Copy link

CL https://golang.org/cl/38791 mentions this issue.

@rsc
Copy link
Contributor

rsc commented Jun 15, 2017

Apologies for ignoring this for a while. CL 38791 PS 6 is so very simple that as long as you can confirm that it suffices for the things you want to build outside the standard library, I'm certainly happy with it. We should leave the submit until Go 1.10 at this point. Please do build some real transformers with it to exercise that it's the right level, but it looks good to me. Will mark this approved.

@rsc rsc modified the milestones: Go1.10, Proposal Jun 15, 2017
@rsc rsc changed the title proposal: encoding/xml token stream transformation API encoding/xml: add decode from TokenReader, to enable stream transformers Jun 15, 2017
@SamWhited
Copy link
Member Author

SamWhited commented Jun 15, 2017

Thanks @rsc; I've sort of been ignoring this one too and was focusing on the more complex DOM-like API. I rebased the CL and pushed a copy of my transformers that use this patch here: https://bitbucket.org/mellium/xmlstream/branch/encoding_xml_decode_wrapper (my CI on Bitbucket is failing, but only because it doesn't have the patch, the transformers all appear to work the same locally).

The only odd one was the RemoveElement transformer, which required that I reimplement skip anyways because using RawToken instead of token breaks Skip somehow. I'm still a bit up in the air on whether using Token or RawToken in the transformers is better (Token makes the API easier to use, RawToken means we don't have to worry about what RawToken should do). I'm starting to lean back in the direction of using Token and having RawToken spit out "untransformed" tokens (which sort of makes sense, but could also be confusing).

EDIT: I pushed another branch that works the same way, but uses the API as if the Token method were in the interface instead of the RawToken method (and made the corresponding changes to test it on the CL, these are not yet pushed): https://bitbucket.org/mellium/xmlstream/branch/encoding_xml_decode_wrapper_token

This is in fact a bit easier in that one case since you can still use Skip. I'd still worry that it would feel broken for people who have to use RawToken heavily though. The correct behavior here is eluding me.

EDIT: Pushed the change to use Token instead of RawToken again. I suspect I'm overthinking this. Token makes things marginally easier in many cases, and having RawToken just ignore the transformation if you still need to use it sort of makes sense. It's just a vague feeling that ignoring RawToken will cause problems anyways; I thought I had a concrete example of why it was bad at one point, but I can't think of it anymore so I've changed it back.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants