New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io/fs, net/http: define interface for automatic ETag serving #60940
Comments
I have a hacky implementation available here: |
Thinking out loud (sorry for the noise), it seems even better to add an optional method to
Updated proposal:First, in io/fs, define // A FileHashesInfo provides the file hashes in constant time.
type FileHashesInfo interface {
fs.FileInfo
// FileHashes returns content hashes of the file that uniquely
// identifies the file contents.
// The returned hashes should be of the form algorithm-base64,
// and implementations are encouraged to use sha256, sha384, or sha512
// as the algorithms and RawStdEncoding as base64 encoding,
// for interoperability with other systems.
//
// FileHashes must NOT compute any hash of the file during the call.
// That is, it must run in time O(1) not O(length of file).
// If no content hash is already available, FileHashes should
// return nil rather than take the time to compute one.
FileHashes() []string
} Second, in net/http, when serving a File (in func setEtag(w ResponseWriter, fi fs.FileInfo) {
if ch, ok := fi.(FileHashesInfo); ok {
if w.Header().Get("Etag") != "" {
return
}
for _, h := range ch.FileHashes() {
// TODO: skip the hash if unsuitable (define "suitable")
// TODO: should the etag be weak or strong?
w.Header().Set("Etag", `W/"`+h+`"`)
break
}
}
} Third (probably out of scope for this proposal), add the |
This proposal has been added to the active column of the proposals project |
To summarize the proposal above:
I think we can probably improve on a few of these decisions.
|
Thanks for the feedback!
Logically, I would put the Trying to be more concrete, I was able to find 3 implementations of fs.FS in the stdlib:
So the first case dos not really influence the decision. For cases outside of the stdlib, I looked up S3 implementations and found that the hashes were returned when GETting the file or requesting the HEAD (so adding it to the FileInfo would mirror both ways of accessing the hashes, while attaching to
I would drop the
I really like your
My example code only sets the ETAg once. I think this should be sufficient. However to work fine, the implementer should:
PS: do you think that dropping a comment in the previous proposal would be a good idea, to gather more feedback? |
Updated proposal, taking into accounts the comments above: // ContentHashesInfo provides pre-computed hashes of the file contents.
type ContentHashesInfo interface {
FileInfo
// ContentHashes returns pre-computed hashes of the file contents.
//
// ContentHashes must NOT compute any hash of the file during the call.
// That is, it must run in time O(1) not O(length of file).
// If no content hash is already available, ContentHashes should
// return nil rather than take the time to compute one.
//
// The order of the returned hash must be stable.
ContentHashes() []Hash
}
// Hash indicates the hash of a given content.
type Hash struct {
// Algorithm indicates the algorithm used. Implementations are encouraged
// to use package-like name for interoperability with other systems
// (lowercase, without dash: e.g. sha256, sha1, crc32)
Algorithm string
// Sum is the result of the hash, it should not be modified.
Sum []byte
} I have created a new |
I'm still on the fence about FileInfo vs File, but I'm willing to try FileInfo and see how it goes. It seems like we are at:
The remaining question in my reply above is (4), namely what does HTTP do when Hash returns multiple hashes? As far as I can tell it makes no sense to send back multiple ETag headers. |
I would suggest to use the first suitable hash. For instance taking the first one with at least 32 bits (and truncating it to 512 bits): if w.Header().Get("Etag") != "" {
return
}
const minLen, maxLen = 4, 64
for _, h := range ch.ContentHashes() {
buf := h.Sum
if len(buf) < minLen {
// hash should have at least 32 bits
continue
}
if len(buf) > maxLen {
buf = buf[:maxLen]
}
// Strong etag: any encoding middleware should set it to weak.
w.Header().Set("Etag", `"`+base64.RawStdEncoding.EncodeToString(buf)+`"`)
break
} |
Nit: Hash() returns more than one Hash. Hashes()? |
It seems fine for Hash to return []Hash. It doesn't have to be Hashes. Have all remaining concerns about this proposal been addressed? |
Based on the discussion above, this proposal seems like a likely accept. |
No change in consensus, so accepted. 🎉 The proposal details are as follows. In io/fs, we add:
Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum). In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”. |
@oliverpool If you're interested in working on this, feel free to send a patch |
Should there be a package function in fs like // Hashes returns fi.Hash() if fi implements HashFileInfo. Otherwise it returns nil.
func Hashes(fi FileInfo) []Hash {
if hfi, ok := fi.(HashFileInfo); ok {
return hfi.Hash()
}
return nil
} |
Change https://go.dev/cl/562875 mentions this issue: |
@earthboundkid I don't think that there is a need for it. However if usage shows that there is interest in such a package function, it can be added later. |
Enables browser caching. Calculate & cache hashes on demand. Remove if golang/go#60940 is ever complete.
A couple points I observed while reviewing https://go.dev/cl/562875, copying here for visibility: A zip file's hash is a CRC32. 32 bits is super small. Do we want to automatically serve Etags headers with hashes that small? If we set a larger minimum (128 bits, maybe?), do we want to bother having zip entries implement
|
I would argue that ETag should only be used for integrity check (not for security) and that in this case 32 bits should be fine (however note that if a stronger algorithm is available earlier in the list, it will be selected instead of CRC32).
I am opened to both (in the initial implementation, I tried to keep the changes as small as possible).
This follows the latest comment from the proposal review group, however the algorithm name could be dropped if wanted. |
Aren't ETags used to identify when a cache needs reloading? If you get a duplicate tag, the user is left with a stale entry. I believe 32 bits gets you a 1% probability of a collision with only ~1000 entries, ~75% chance with 100000. That seems pretty high. |
ETags are used in combination with the URL. So the only issue is if the same URL gets the same hash (which probability should be quite low). Strictly speaking it is associated with a given Source: https://www.rfc-editor.org/rfc/rfc7232#section-2.3 (emphasis mine):
Furthermore MDN gives an example (emphasis mine):
I am pretty sure that browsers associate the ETag value with a given URL. |
Yea, but 1% probability of collision on 1000 versions of a resource for the same URL can lead to pretty bad UX once you hit that 1%. Revision number is (hopefully) monotonic/incremental and never repeats and thus unique for different content. Maybe API should support weak etags and crc32 would then be a weak etag? From RC 7232:
Maybe the proposal should be changed so that On the other hand, it seems Google Cloud accepts crc32 and even recommends it:
BTW, I just noticed (and noticing that it might be bike shedding):
Why not URL encoding? I thought URL encoding is the norm for Etags because it leads to possibly less confusion with the potential
Do you have any reference to other systems which use |
Then you should probably use something else that zip files. If your
Providing it as a byte slice allows the consumer to encode it as hex if needed.
The Edit: note that SRI uses
The
I will update the first comment. |
Thanks for the replies. I understand now. I think
I think we should just go with full sha256. Then this information can also be use for SRI. |
I have just noticed that my implementation exposes a |
What does this mean? Maybe "returned hashes"? What does it mean to be constant? Between multiple calls to it? |
This is an integrity check: You provide some data and its expected hash, and GCS returns an error if the data doesn't match the hash. The expectation is that the data and hash match unless something has gone wrong. The birthday paradox doesn't apply. ETags (so far as I can tell) are used for cache validation: The expectation is that the data for a resource will change over time, and you use the ETag to tell when it has done so. In this case, the birthday paradox does apply.
Are there not URLs where the data changes frequently? If a URL has data which changes once per second, I believe you'd have a ~50% chance of a 32-bit hash collision happening in a day. I'm not at all familiar with ETag usage in practice, but looking at various sources of documentation I can't seem to find any examples of ETags containing hashes this short. GCS (https://cloud.google.com/storage/docs/hashes-etags#etags) returns an MD5 under certain circumstances, and an undefined value in others. (I'm going to guess that they started with MD5, people came to depend on it, and now they can't change it to something better.) https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag has an example with a 160-bit value encoded as a hex string. The use case of serving files directly out of a |
Regarding the hash provided for I don't know anything about the history of the implementation of embedded files, so I did a little bit of code archaeology. I'm not clear on why we store a per-file hash at all, but the hash is computed by I can't figure out why there's a hash in there; the I think deriving a "half-sha256" from this hash value is not a good idea, since this is specifically an arbitrary hash. Storing an actual SHA-256 for embedded files may be possible, but https://go.dev/cl/402594 (which introduced I see two possible paths:
|
Regarding CRC32, apparently express used it until mid-2014 for etags: expressjs/express#1435
Edit: as noted by @AGWA this would not be conformant with the RFC 7232 express made this mistake as well Regarding embed, I think having a full sha-256 would be much more useful:
|
Weak ETags are meant for cases where two resources aren't byte-for-byte equal, but it's still OK to consider them equivalent for caching (e.g. they're simply encoded differently, or the only differences are semantically unimportant). See the entirety of RFC 7232 Section 2.1 for details. It's not OK for two entirely different resources to have the same ETag, whether it's weak or strong. IMO, that makes CRC32 and other short hashes unsuitable for ETags. |
I recognize that we're almost certainly not going to try to redesign the API in the accepted proposal. With that said, I think there are a couple unfortunate points in I think I'd prefer something like:
|
@neild: I think the issue with your proposal is that it makes hash encoding to a string non-optional? Maybe |
Could we just document that returned slices should not be modified? |
@rsc wrote #43223 (comment) in 2021, regarding the previous (declined) proposal.
|
Returning a fresh slice every time seems bad for garbage collection. Returning the same slice seems like a recipe for spooky bugs. Time for |
To sum up the discussions that happened lately, I see 3 open questions:
My main question is: how can we reach a decision regarding those 3 points?
|
I propose that 128 bits is the minimum sufficient hash size for automatically serving an ETag header. 128 bits is more than enough to avoid birthday paradox collisions. I propose that rather than baking this limitation into I don't have a strong opinion on whether
|
Pretty sure it was put there specifically to support ETag. It's ready to be used when we are ready to use it. |
I am not fully convinced by this line of reasoning. The goal of having an exported |
I have updated the CL with @neild suggestions:
I am not sure that I would be able to change the hash of I would appreciate any feedback, to move this forward. |
Proposal committee, I'd appreciate your input here. This proposal is accepted, but there are some questions regarding it that need addressing. To summarize (see discussion above for details):
|
Isn't it maybe easier to make a generic So then you could have something like:
This could then work with any source of a file system, like |
Renewal of #43223
Here is a proposal which tries to address the concerns raised in #43223.
Accepted proposal
In io/fs, define
Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).
In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.
Original proposal (fs.File)
First, in io/fs, define
Second, in net/http, when serving a File (in
serveFile
, right beforeserveContent
for instance), if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.Third, add the
ContentHash
method onhttp.ioFile
file (as a proxy to thefs.File
ContentHash
method).Fourth (probably out of scope for this proposal), add the
ContentHash
method onembed.FS
files.This proposal fixes the following objections:
The caller will simply get all available implementations and can filter them out.
The implementers are encouraged to indicate the algorithm used for each hash.
This one does.
This implementation cannot return an error (the implementer choose to panic. Returning
nil
seems better suited).It is currently very cumbersome, since the middleware would need to open the file as well (which means having the exact same logic regarding URL cleanup as the http.FileServer). Here is my attempt: https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag/fileserver.go (even uglier, since I have use
reflect
to retrieve the underlyingfs.File
from thehttp.File
).Could a "github-collaborator" post a message in #43223 to notify the people who engaged in previous proposal of this updated proposal?
The text was updated successfully, but these errors were encountered: