降低 LLM 调用成本与延迟的 Prompt Caching 技巧

解决大模型应用成本高昂与响应延迟痛点：通过利用 Anthropic 原生提示词缓存、Redis 响应缓存及 CAG 模式，消除重复计算开销，确保静态上下文不被重复计费。

为什么需要这个技能

在使用 Claude 等 API 时，每次请求都会重新计算相同的前缀内容，造成 Token 浪费。Anthropic 提供了独特的 Prompt Caching 机制，允许开发者对稳定的系统提示词（System Prompts）进行缓存。一旦命中缓存，这部分 Token 不仅不计费，还能显著加快响应速度。对于拥有庞大知识库或频繁复用模板的 SaaS 应用，这能直接带来 90% 以上的缓存 Token 成本节省。

适用场景

系统提示词固定：系统指令或知识库基础部分长期不变，仅用户输入动态变化的场景。
高频重复查询：相同的用户问题反复出现，适合配合 Redis 实现完整的响应缓存。
文档检索替代：当文档库稳定且总大小适合 Context Window 时，可用 CAG 模式预缓存文档，替代每次查询时的 RAG 检索过程。

核心工作流

1. 实施 Anthropic Prompt Caching

利用 cache_control: { type: "ephemeral" } 标记静态文本，确保它们被缓存而非实时计算。

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Cache the stable parts of your prompt
async function queryWithCaching(userQuery: string) {
    const response = await client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 1024,
        system: [
            {
                type: "text",
                text: LONG_SYSTEM_PROMPT,  // Your detailed instructions
                cache_control: { type: "ephemeral" }  // Cache this!
            },
            {
                type: "text",
                text: KNOWLEDGE_BASE,  // Large static context
                cache_control: { type: "ephemeral" }
            }
        ],
        messages: [
            { role: "user", content: userQuery }  // Dynamic part
        ]
    });

    // Check cache usage
    console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
    console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);

    return response;
}

2. 构建响应缓存层 (Response Caching)

针对完全相同或语义相似的用户查询，使用 Redis 存储历史响应结果。必须将 Prompt、模型参数（如 Temperature）纳入哈希键，防止误用过时回答。

import { createHash } from 'crypto';
import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

class ResponseCache {
    private ttl = 3600;  // 1 hour default

    // Exact match caching
    async getCached(prompt: string): Promise<string | null> {
        const key = this.hashPrompt(prompt);
        return await redis.get(`response:${key}`);
    }

    async setCached(prompt: string, response: string): Promise<void> {
        const key = this.hashPrompt(prompt);
        await redis.set(`response:${key}`, response, 'EX', this.ttl);
    }

    private hashPrompt(prompt: string): string {
        return createHash('sha256').update(prompt).digest('hex');
    }
}

3. 构建 CAG 系统 (Cache Augmented Generation)

当文档集稳定时，不要每次查询都进行 RAG 检索，而是将文档内容预处理并预计算为 Prompt 的一部分。通过版本号或内容哈希机制管理缓存失效，确保数据新鲜度。

class CAGSystem {
    private cachedContext: string | null = null;

    async buildCachedContext(documents: Document[]): Promise<void> {
        const formatted = documents.map(d =>
            `## ${d.title}\n${d.content}`
        ).join('\n\n');
        this.cachedContext = formatted;
    }

    async query(userQuery: string): Promise<string> {
        const response = await client.messages.create({
            model: "claude-sonnet-4-20250514",
            system: [
                {
                    type: "text",
                    text: "You are a helpful assistant with access to the following documentation.",
                    cache_control: { type: "ephemeral" }
                },
                {
                    type: "text",
                    text: this.cachedContext!,  // Pre-cached docs
                    cache_control: { type: "ephemeral" }
                }
            ],
            messages: [{ role: "user", content: userQuery }]
        });
        return response.content[0].text;
    }
}

潜在风险与规避

Cache Miss 导致延迟激增：如果缓存命中率低，检查缓存的操作本身也会增加延迟。建议使用非阻塞逻辑，将 LLM 请求与缓存检查并行化，并在缓存未命中时取消耗时操作。

缓存内容过时：当源数据发生变化（如知识库更新）时，旧缓存可能导致错误回答。必须实现基于内容哈希或时间戳的失效机制，确保源内容哈希变更后自动清除相关缓存。

下载和安装

下载 prompt-caching 中文版 Skill ZIP

解压后将目录放入你的 AI 工具 skills 文件夹，重启工具后即可使用。具体路径参考内附的 USAGE.zh.md。

你可能还需要

暂无推荐

降低 LLM 调用成本与延迟的 Prompt Caching 技巧 #

为什么需要这个技能 #

适用场景 #

核心工作流 #

1. 实施 Anthropic Prompt Caching #

2. 构建响应缓存层 (Response Caching) #

3. 构建 CAG 系统 (Cache Augmented Generation) #

潜在风险与规避 #

下载和安装 #

你可能还需要 #