浏览器缓存

缓存分类

前端缓存:

  • 按缓存位置分类 (memory cache, disk cache, Service Worker 等)
  • 按失效策略分类 (Cache-Control, ETag 等)

Relate to https://juejin.cn/post/6844903747357769742 网上到处都是抄来抄去,不太清楚是不是原作者。

换缓存位置分类

我们可以在 Chrome 的开发者工具中,Network -> Size 一列看到一个请求最终的处理方式:如果是大小 (多少 K, 多少 M 等) 就表示是网络请求,否则会列出 from memory cache, from disk cachefrom ServiceWorker

它们的优先级是:(由上到下寻找,找到即返回;找不到则继续)

  1. Service Worker
  2. Memory Cache
  3. Disk Cache
  4. 网络请求

memory cache

memory cache 是内存中的缓存,(与之相对 disk cache 就是硬盘上的缓存)。按照操作系统的常理:先读内存,再读硬盘。disk cache 将在后面介绍 (因为它的优先级更低一些),这里先讨论 memory cache

几乎所有的网络请求资源都会被浏览器自动加入到 memory cache 中。但是也正因为数量很大但是浏览器占用的内存不能无限扩大这样两个因素,memory cache 注定只能是个“短期存储”。常规情况下,浏览器的 TAB 关闭后该次浏览的 memory cache 便告失效 (为了给其他 TAB 腾出位置)。而如果极端情况下 (例如一个页面的缓存就占用了超级多的内存),那可能在 TAB 没关闭之前,排在前面的缓存就已经失效了。

刚才提过,几乎所有的请求资源 都能进入 memory cache,这里细分一下主要有两块:

  1. preloader

    熟悉浏览器处理流程的同学们应该了解,在浏览器打开网页的过程中,会先请求 HTML 然后解析。之后如果浏览器发现了 js, css 等需要解析和执行的资源时,它会使用CPU 资源对它们进行解析和执行。在古老的年代(大约 2007 年以前),“请求 js/css--- 解析执行-- - 请求下一个 js/css - --解析执行下一个 js/css” 这样的“串行”操作模式在每次打开页面之前进行着。很明显在解析执行的时候,网络请求是空闲的,这就有了发挥的空间:我们能不能一边解析执行 js/css,一边去请求下一个(或下一批)资源呢?

    这就是 preloader 要做的事情。不过 preloader 没有一个官方标准,所以每个浏览器的处理都略有区别。例如有些浏览器还会下载 css 中的 @import 内容或者 <video>poster 等。

    而这些被 preloader 请求够来的资源就会被放入 memory cache 中,供之后的解析执行操作使用。

  2. preload (虽然看上去和刚才的 preloader 就差了俩字母)。实际上这个大家应该更加熟悉一些,例如<link rel="preload">。这些显式指定的预加载资源,也会被放入 memory cache 中。

memory cache 机制保证了一个页面中如果有两个相同的请求 (例如两个 src 相同的 <img>,两个 href 相同的 <link>)都实际只会被请求最多一次,避免浪费。

不过在匹配缓存时,除了匹配完全相同的 URL 之外,还会比对他们的类型,CORS 中的域名规则等。因此一个作为脚本 (script) 类型被缓存的资源是不能用在图片 (image) 类型的请求中的,即便他们 src 相等。 在从 memory cache 获取缓存内容时,浏览器会忽视例如 max-age=0, no-cache 等头部配置。 但是实际上在 Chrome@v103 的版本实验,max-age=0no-cache 的情况是不会走 memory cache。在新的版本比如图片在内存中不会发多次请求了 例如页面上存在几个相同 src 的图片,即便它们可能被设置为不缓存,但依然会从 memory cache 中读取。这是因为 memory cache 只是短期使用,大部分情况生命周期只有一次浏览而已。而 max-age=0 在语义上普遍被解读为不要在下次浏览时使用,所以和 memory cache 并不冲突。

但如果站长是真心不想让一个资源进入缓存,就连短期也不行,那就需要使用 no-store。存在这个头部配置的话,即便是 memory cache 也不会存储,自然也不会从中读取了。(后面的第二个示例有关于这点的体现)

disk cache

disk cache 也叫 HTTP cache,顾名思义是存储在硬盘上的缓存,因此它是持久存储的,是实际存在于文件系统中的。而且它允许相同的资源在跨会话,甚至跨站点的情况下使用,例如两个站点都使用了同一张图片。

disk cache 会严格根据 HTTP 头信息中的各类字段来判定哪些资源可以缓存,哪些资源不可以缓存;哪些资源是仍然可用的,哪些资源是过时需要重新请求的。当命中缓存之后,浏览器会从硬盘中读取资源,虽然比起从内存中读取慢了一些,但比起网络请求还是快了不少的。绝大部分的缓存都来自 disk cache

关于 HTTP 的协议头中的缓存字段,我们会在稍后进行详细讨论。凡是持久性存储都会面临容量增长的问题,disk cache 也不例外。在浏览器自动清理时,会有神秘的算法去把“最老的”或者“最可能过时的”资源删除,因此是一个一个删除的。不过每个浏览器识别“最老的”和“最可能过时的”资源的算法不尽相同,可能也是它们差异性的体现。

Service Worker

上述的缓存策略以及缓存/读取/失效的动作都是由浏览器内部判断 & 进行的,我们只能设置响应头的某些字段来告诉浏览器,而不能自己操作。举个生活中去银行存/取钱的例子来说,你只能告诉银行职员,我要存/取多少钱,然后把由他们会经过一系列的记录和手续之后,把钱放到金库中去,或者从金库中取出钱来交给你。

Service Worker 的出现,给予了我们另外一种更加灵活,更加直接的操作方式。依然以存/取钱为例,我们现在可以绕开银行职员,自己走到金库前(当然是有别于上述金库的一个单独的小金库),自己把钱放进去或者取出来。因此我们可以选择放哪些钱(缓存哪些文件),什么情况把钱取出来(路由匹配规则),取哪些钱出来(缓存匹配并返回)。当然现实中银行没有给我们开放这样的服务。

Service Worker 能够操作的缓存是有别于浏览器内部的 memory cache 或者 disk cache 的。我们可以从 Chrome 的 F12 中,Application -> Cache Storage 找到这个单独的“小金库”。除了位置不同之外,这个缓存是永久性的,即关闭 TAB 或者浏览器,下次打开依然还在(而 memory cache 不是)。有两种情况会导致这个缓存中的资源被清除:手动调用 API cache.delete(resource) 或者容量超过限制,被浏览器全部清空。如果 Service Worker 没能命中缓存,一般情况会使用 fetch() 方法继续获取资源。这时候,浏览器就去 memory cache 或者 disk cache 进行下一次找缓存的工作了。注意:经过 Service Workerfetch() 方法获取的资源,即便它并没有命中 Service Worker 缓存,甚至实际走了网络请求,也会标注为from ServiceWorker。这个情况在后面的第三个示例中有所体现。请求网络

网络请求

如果一个请求在上述 3 个位置都没有找到缓存,那么浏览器会正式发送网络请求去获取内容。之后容易想到,为了提升之后请求的缓存命中率,自然要把这个资源添加到缓存中去。具体来说:

  1. 根据 Service Worker 中的 handler 决定是否存入 Cache Storage (额外的缓存位置)。
  2. 根据 HTTP 头部的相关字段( Cache-control,Pragma 等)决定是否存入 disk cache
  3. memory cache 保存一份资源的引用,以备下次使用。

Pragma 是一个在 HTTP/1.0 中规定的通用首部,这个首部的效果依赖于不同的实现,所以在“请求 - 响应”链中可能会有不同的效果。它用来向后兼容只支持 HTTP/1.0 协议的缓存服务器,那时候 HTTP/1.1 协议中的 Cache-Control 还没有出来。

注意:由于 Pragma 在 HTTP 响应中的行为没有确切规范,所以不能可靠替代 HTTP/1.1 中通用首部 Cache-Control,尽管在请求中,假如 Cache-Control 不存在的话,它的行为与 Cache-Control: no-cache 一致。建议只在需要兼容 HTTP/1.0 客户端的场合下应用 Pragma 首部。

前端页面和资源是否被浏览器缓存,一般是由服务器通过设置 http 响应头部去告诉浏览器的。响应头是有两对相关联的头的,一个是 HTTP/1.0 的 Expires 和 Last-Modified,另一对是 HTTP/1.1 增加的 Cache-Control 和 Etag。

按失效策略分类

memory cache 是浏览器为了加快读取缓存速度而进行的自身的优化行为,不受开发者控制,也不受HTTP 协议头的约束,算是一个黑盒。Service Worker 是由开发者编写的额外的脚本,且缓存位置独立,出现也较晚,使用还不算太广泛。所以我们平时最为熟悉的其实是 disk cache,也叫 HTTP cache (因为不像 memory cache,它遵守 HTTP 协议头中的字段)。平时所说的强制缓存,对比缓存,以及 Cache-Control 等,也都归于此类。

强缓存

强制缓存的含义是,当客户端请求后,会先访问缓存数据库看缓存是否存在。如果存在则直接返回;不存在则请求真的服务器,响应后再写入缓存数据库。

强制缓存直接减少请求数,是提升最大的缓存策略。 它的优化覆盖了文章开头提到过的请求数据的全部三个步骤。如果考虑使用缓存来优化网页性能的话,强制缓存应该是首先被考虑的。

可以造成强制缓存的字段是 Cache-controlExpires

协商缓存

当强制缓存失效(超过规定时间)时,就需要使用对比缓存,由服务器决定缓存内容是否失效。

流程上说,浏览器先请求缓存数据库,返回一个缓存标识。之后浏览器拿这个标识和服务器通讯。如果缓存未失效,则返回 HTTP 状态码 304 表示继续使用,于是客户端继续使用缓存;如果失效,则返回新的数据和缓存规则,浏览器响应数据后,再把规则写入到缓存数据库。

对比缓存在请求数上和没有缓存是一致的,但如果是 304 的话,返回的仅仅是一个状态码而已,并没有实际的文件内容,因此 在响应体体积上的节省是它的优化点。它的优化覆盖了文章开头提到过的请求数据的三个步骤中的最后一个:“响应”。通过减少响应体体积,来缩短网络传输时间。所以和强制缓存相比提升幅度较小,但总比没有缓存好。

对比缓存是可以和强制缓存一起使用的,作为在强制缓存失效后的一种后备方案。实际项目中他们也的确经常一同出现。

对比缓存有 2 组字段(不是两个):

  • Last-Modified & If-Modified-Since

    1. 服务器通过 Last-Modified 字段告知客户端,资源最后一次被修改的时间,例如
      yaml
      1
      Last-Modified: Mon, 10 Nov 2018 09:10:11 GMT
    2. 浏览器将这个值和内容一起记录在缓存数据库中。
    3. 下一次请求相同资源时时,浏览器从自己的缓存中找出“不确定是否过期的”缓存。因此在请求头中将上次的 Last-Modified 的值写入到请求头的 If-Modified-Since 字段
    4. 服务器会将 If-Modified-Since 的值与 Last-Modified 字段进行对比。如果相等,则表示未修改,响应 304;反之,则表示修改了,响应 200 状态码,并返回数据。 但是他还是有一定缺陷的:
      • 如果资源更新的速度是秒以下单位,那么该缓存是不能被使用的,因为它的时间单位最低是秒。
      • 如果文件是通过服务器动态生成的,那么该方法的更新时间永远是生成的时间,尽管文件可能没有变化,所以起不到缓存的作用。
  • Etag & If-None-Match

    • 为了解决上述问题,出现了一组新的字段 EtagIf-None-Match
    • Etag 存储的是文件的特殊标识(一般都是 hash 生成的),服务器存储着文件的 Etag 字段。之后的流程和 Last-Modified 一致,只是 Last-Modified 字段和它所表示的更新时间改变成了 Etag 字段和它所表示的文件 hash,把 If-Modified-Since 变成了 If-None-Match。服务器同样进行比较,命中返回 304, 不命中返回新资源和 200。

Etag 的优先级高于 Last-Modified

The following table lists the standard Cache-Control directives:

RequestResponse
max-agemax-age
max-stale-
min-fresh-
-s-maxage
no-cacheno-cache
no-storeno-store
no-transformno-transform
only-if-cached-
-must-revalidate
-proxy-revalidate
-must-understand
-private
-public
-immutable
-stale-while-revalidate
stale-if-errorstale-if-error

Note: Check the compatibility table for their support; user agents that don't recognize them should ignore them.

This section lists directives that affect caching — both response directives and request directives.

max-age

The max-age=N response directive indicates that the response remains fresh until N seconds after the response is generated.

Cache-Control: max-age=604800

Indicates that caches can store this response and reuse it for subsequent requests while it's fresh.

Note that max-age is not the elapsed time since the response was received; it is the elapsed time since the response was generated on the origin server. So if the other cache(s) — on the network route taken by the response — store the response for 100 seconds (indicated using the Age response header field), the browser cache would deduct 100 seconds from its freshness lifetime.

Cache-Control: max-age=604800 Age: 100

s-maxage

The s-maxage response directive also indicates how long the response is fresh for (similar to max-age) — but it is specific to shared caches, and they will ignore max-age when it is present.

Cache-Control: s-maxage=604800

no-cache

The no-cache response directive indicates that the response can be stored in caches, but the response must be validated with the origin server before each reuse, even when the cache is disconnected from the origin server.

Cache-Control: no-cache

If you want caches to always check for content updates while reusing stored content, no-cache is the directive to use. It does this by requiring caches to revalidate each request with the origin server.

Note that no-cache does not mean "don't cache". no-cache allows caches to store a response but requires them to revalidate it before reuse. If the sense of "don't cache" that you want is actually "don't store", then no-store is the directive to use.

must-revalidate

The must-revalidate response directive indicates that the response can be stored in caches and can be reused while fresh. If the response becomes stale, it must be validated with the origin server before reuse.

Typically, must-revalidate is used with max-age.

Cache-Control: max-age=604800, must-revalidate

HTTP allows caches to reuse stale responses when they are disconnected from the origin server. must-revalidate is a way to prevent this from happening - either the stored response is revalidated with the origin server or a 504 (Gateway Timeout) response is generated.

proxy-revalidate

The proxy-revalidate response directive is the equivalent of must-revalidate, but specifically for shared caches only.

no-store

The no-store response directive indicates that any caches of any kind (private or shared) should not store this response.

Cache-Control: no-store

private

The private response directive indicates that the response can be stored only in a private cache (e.g. local caches in browsers).

Cache-Control: private

You should add the private directive for user-personalized content, especially for responses received after login and for sessions managed via cookies.

If you forget to add private to a response with personalized content, then that response can be stored in a shared cache and end up being reused for multiple users, which can cause personal information to leak.

public

The public response directive indicates that the response can be stored in a shared cache. Responses for requests with Authorization header fields must not be stored in a shared cache; however, the public directive will cause such responses to be stored in a shared cache.

Cache-Control: public

In general, when pages are under Basic Auth or Digest Auth, the browser sends requests with the Authorization header. This means that the response is access-controlled for restricted users (who have accounts), and it's fundamentally not shared-cacheable, even if it has max-age.

You can use the public directive to unlock that restriction.

Cache-Control: public, max-age=604800

Note that s-maxage or must-revalidate also unlock that restriction.

If a request doesn't have an Authorization header, or you are already using s-maxage or must-revalidate in the response, then you don't need to use public.

must-understand

The must-understand response directive indicates that a cache should store the response only if it understands the requirements for caching based on status code.

must-understand should be coupled with no-store for fallback behavior.

Cache-Control: must-understand, no-store

If a cache doesn't support must-understand, it will be ignored. If no-store is also present, the response isn't stored.

If a cache supports must-understand, it stores the response with an understanding of cache requirements based on its status code.

no-transform

Some intermediaries transform content for various reasons. For example, some convert images to reduce transfer size. In some cases, this is undesirable for the content provider.

no-transform indicates that any intermediary (regardless of whether it implements a cache) shouldn't transform the response contents.

Note: Google's Web Light is one kind of such an intermediary. It converts images to minimize data for a cache store or slow connection and supports no-transform as an opt-out option.

immutable

The immutable response directive indicates that the response will not be updated while it's fresh.

Cache-Control: public, max-age=604800, immutable

A modern best practice for static resources is to include version/hashes in their URLs, while never modifying the resources — but instead, when necessary, updating the resources with newer versions that have new version-numbers/hashes, so that their URLs are different. That's called the cache-busting pattern.

<script src=https://example.com/react.0.0.0.js></script>

When a user reloads the browser, the browser will send conditional requests for validating to the origin server. But it's not necessary to revalidate those kinds of static resources even when a user reloads the browser, because they're never modified. immutable tells a cache that the response is immutable while it's fresh and avoids those kinds of unnecessary conditional requests to the server.

When you use a cache-busting pattern for resources and apply them to a long max-age, you can also add immutable to avoid revalidation.

stale-while-revalidate

The stale-while-revalidate response directive indicates that the cache could reuse a stale response while it revalidates it to a cache.

Cache-Control: max-age=604800, stale-while-revalidate=86400

In the example above, the response is fresh for 7 days (604800s). After 7 days it becomes stale, but the cache is allowed to reuse it for any requests that are made in the following day (86400s), provided that they revalidate the response in the background.

Revalidation will make the cache be fresh again, so it appears to clients that it was always fresh during that period — effectively hiding the latency penalty of revalidation from them.

If no request happened during that period, the cache became stale and the next request will revalidate normally.

stale-if-error

The stale-if-error response directive indicates that the cache can reuse a stale response when an origin server responds with an error (500, 502, 503, or 504).

Cache-Control: max-age=604800, stale-if-error=86400

In the example above, the response is fresh for 7 days (604800s). After 7 days it becomes stale, but it can be used for an extra 1 day (86400s) if the server responds with an error.

After a period of time, the stored response became stale normally. This means that the client will receive an error response as-is if the origin server sends it.

The no-cache request directive asks caches to validate the response with the origin server before reuse.

Cache-Control: no-cache

no-cache allows clients to request the most up-to-date response even if the cache has a fresh response.

Browsers usually add no-cache to requests when users are force reloading a page.

The no-store request directive allows a client to request that caches refrain from storing the request and corresponding response — even if the origin server's response could be stored.

Cache-Control: no-store

Note that the major browsers do not support requests with no-store.

The max-age=N request directive indicates that the client allows a stored response that is generated on the origin server within N seconds — where N may be any non-negative integer (including 0).

Cache-Control: max-age=3600

In the case above, if the response with Cache-Control: max-age=604800 was generated more than 3 hours ago (calculated from max-age and the Age header), the cache couldn't reuse that response.

Many browsers use this directive for reloading, as explained below.

Cache-Control: max-age=0

max-age=0 is a workaround for no-cache, because many old (HTTP/1.0) cache implementations don't support no-cache. Recently browsers are still using max-age=0 in "reloading" — for backward compatibility — and alternatively using no-cache to cause a "force reloading".

If the max-age value isn't non-negative (for example, -1) or isn't an integer (for example, 3599.99), then the caching behavior is undefined. However, the Calculating Freshness Lifetime section of the HTTP specification states:

Caches are encouraged to consider responses that have invalid freshness information to be stale.

In other words, for any max-age value that isn't an integer or isn't non-negative, the caching behavior that's encouraged is to treat the value as if it were 0.

The max-stale=N request directive indicates that the client allows a stored response that is stale within N seconds.

Cache-Control: max-stale=3600

In the case above, if the response with Cache-Control: max-age=604800 was generated more than 3 hours ago (calculated from max-age and the Age header), the cache couldn't reuse that response.

Clients can use this header when the origin server is down or too slow and can accept cached responses from caches even if they are a bit old.

Note that the major browsers do not support requests with max-stale.

The min-fresh=N request directive indicates that the client allows a stored response that is fresh for at least N seconds.

Cache-Control: min-fresh=600

In the case above, if the response with Cache-Control: max-age=3600 was stored in caches 51 minutes ago, the cache couldn't reuse that response.

Clients can use this header when the user requires the response to not only be fresh, but also requires that it won't be updated for a period of time.

Note that the major browsers do not support requests with min-fresh.

Same meaning that no-transform has for a response, but for a request instead.

The client indicates that cache should obtain an already-cached response. If a cache has stored a response, it's reused.

If you don't want a response stored in caches, use the no-store directive.

yaml
1
Cache-Control: no-store

Note that no-cache means "it can be stored but don't reuse before validating" — so it's not for preventing a response from being stored.

yaml
1
Cache-Control: no-cache

In theory, if directives are conflicted, the most restrictive directive should be honored. So the example below is basically meaningless because private, no-cache, max-age=0 and must-revalidate conflict with no-store.

yaml
1
2
3
4
5
# conflicted
Cache-Control: private, no-cache, no-store, max-age=0, must-revalidate

# equivalent to
Cache-Control: no-store

For content that's generated dynamically, or that's static but updated often, you want a user to always receive the most up-to-date version.

If you don't add a Cache-Control header because the response is not intended to be cached, that could cause an unexpected result. Cache storage is allowed to cache it heuristically — so if you have any requirements on caching, you should always indicate them explicitly, in the Cache-Control header.

Adding no-cache to the response causes revalidation to the server, so you can serve a fresh response every time — or if the client already has a new one, just respond 304 Not Modified.

Cache-Control: no-cache

Most HTTP/1.0 caches don't support no-cache directives, so historically max-age=0 was used as a workaround. But only max-age=0 could cause a stale response to be reused when caches disconnected from the origin server. must-revalidate addresses that. That's why the example below is equivalent to no-cache.

Cache-Control: max-age=0, must-revalidate

But for now, you can simply use no-cache instead.

Unfortunately, there are no cache directives for clearing already-stored responses from caches.

Imagine that clients/caches store a fresh response for a path, with no request flight to the server. There is nothing a server could do to that path.

Alternatively, Clear-Site-Data can clear a browser cache for a site. But be careful: that clears every stored response for a site — and only in browsers, not for a shared cache.

启发式缓存

https://stackoverflow.com/a/27972908

如果一个请求 response header 没有设置 Expires 和 Cache-Control,但是有设置 Last-Modified 信息 (ps: 有 Etag 的也会走启发式缓存(强缓存),但是在 普通刷新 动作下:至少在较新版的 chrome 里,浏览器会在当前 url 的 request header 带上 chache-control: max-age=0,所以普通刷新动作下 index.html 不会走强缓存,能走协商缓存的情况会走协商缓存,由于这个动作干扰导致问题变成偶现不易排查),这种情况下浏览器会有一个默认的缓存策略:(当前时间 - Last-Modified) * 0.1,这就是启发式缓存。

启发式缓存是强缓存,不过期就不会走 HTTP 请求。

No explicit HTTP Cache Lifetime information was provided. Heuristic expiration policies suggest defaulting to: 10% of the delta between Last-Modified and Date.

解决办法: Cache-Control: no-cache

The no-cache response directive indicates that the response can be stored in caches, but the response must be validated with the origin server before each reuse, even when the cache is disconnected from the origin server.

Cache-Control: no-cache

If you want caches to always check for content updates while reusing stored content, no-cache is the directive to use. It does this by requiring caches to revalidate each request with the origin server.

Note that no-cache does not mean "don't cache". no-cache allows caches to store a response but requires them to revalidate it before reuse. If the sense of "don't cache" that you want is actually "don't store", then no-store is the directive to use.

当然这不是一个好的办法,因为 memory cache, 即页面打开很久没有关闭的情况,这个时候如果文件名没有改变,会直接从内存中读取。导致浏览的依然是历史版本。 但是在 chrome v103 版本 no-cache 是可以解决 memory cache 缓存的问题了