Redis

面向 3-5 年经验 Java 后端开发，覆盖数据结构/持久化/缓存问题/分布式锁/集群等高频考点。

每道题包含中英双语答案、代码示例、常见误区和风控关联。

相关页面：MySQL · 消息队列 · 分布式系统 · 特征平台

Q1. Redis 有哪些数据结构？底层编码分别是什么？

EN: What data types does Redis support? What are the underlying encodings for each?

难度： ★★★★ | 出现频率： 极高（阿里、美团、字节、蚂蚁）

Key Terms: String/字符串 (SDS), List/列表 (quicklist), Hash/哈希 (ziplist/hashtable), Set/集合 (intset/hashtable), ZSet/有序集合 (ziplist/skiplist+hashtable), Stream/流, HyperLogLog/基数估算, Bitmap/位图, GEO/地理位置

答案要点：

类型	底层编码	使用场景
String	SDS (Simple Dynamic String)	缓存、计数器、分布式锁、Session
List	quicklist (ziplist + linkedlist)	消息队列、最新列表、Timeline
Hash	ziplist / hashtable	对象存储（用户信息、商品详情）
Set	intset / hashtable	去重、交集并集（共同好友）、抽奖
ZSet	ziplist / skiplist+hashtable	排行榜、延迟队列、带权重的集合
Stream	radix tree	消息队列（类似 Kafka，支持消费组）

编码转换：元素少/小时用压缩编码（ziplist/intset），超过阈值自动转换为标准编码
SDS vs C string：O(1) 获取长度、二进制安全、自动扩容、惰性释放

常见误区：

❌ 认为 Redis 数据类型就是底层编码 → ✅ 每种类型有多种底层编码，Redis 会根据数据大小自动转换（如 Hash 小数据量用 ziplist，大数据量用 hashtable）
❌ 以为 String 只能存字符串 → ✅ String 底层是 SDS，可以存储字符串、整数、甚至二进制数据（如图片、序列化对象）
❌ 把 List 当队列时没有考虑消费组需求 → ✅ 简单 FIFO 用 List 即可，需要消费组、消息确认等可靠性保障时应选 Stream
❌ Assuming Redis data types are the same as their underlying encodings → ✅ Each type has multiple internal encodings, and Redis automatically converts based on data size (e.g., Hash uses ziplist for small data, hashtable for larger data)
❌ Thinking String can only store text → ✅ String is backed by SDS and can store strings, integers, and binary data (e.g., images, serialized objects)
❌ Using List as a queue without considering consumer group requirements → ✅ Simple FIFO works with List, but for consumer groups and reliable messaging, use Stream instead

延伸追问：

Redis 7.0 之后 ziplist 被替换成了什么数据结构？为什么要替换？
SDS 相比 C 原生字符串有哪些具体优势？预分配策略是怎样的？
如果要用 Redis 存储一个用户画像对象，你会选 Hash 还是 JSON String？为什么？
What data structure replaced ziplist in Redis 7.0, and why was the replacement necessary?
How does SDS pre-allocation work, and what are its concrete advantages over C strings?
When storing a user profile object in Redis, would you choose Hash or JSON String? Why?

风控关联：

Hash 存储用户风险画像（field 对应特征维度），ZSet 做风险评分排行榜，Set 做黑名单去重
Hash is ideal for storing user risk profiles (fields map to feature dimensions), ZSet for risk score leaderboards, and Set for blacklist deduplication
关联特征平台

English Answer：

Redis provides several core data types, and the logical data type is not the same as its physical encoding. A String is implemented with SDS, or Simple Dynamic String. Compared with a C string, SDS supports O(1) length retrieval, binary safety, automatic expansion, and lazy space release, so it can store plain text, integers, serialized objects, and binary data.
A List is implemented as quicklist in modern Redis, which combines compact listpack-style storage with linked-list-like segmentation. It is suitable for simple queues, recent item lists, and timelines. A Hash uses compact encoding for small objects and hashtable encoding after it crosses size or element thresholds, so it is suitable for user profiles, product details, and feature maps.
A Set uses intset when all elements are integers and the collection is small; otherwise it uses hashtable. It is commonly used for deduplication, intersections, unions, lotteries, and blacklists. A ZSet, or sorted set, uses compact encoding for small data and skiplist plus hashtable for larger data. The skiplist supports range queries and ranking, while the hashtable supports fast member lookup.
Redis also provides Stream, HyperLogLog, Bitmap, and GEO. Stream is backed by radix-tree-related structures and is used for lightweight message queues with consumer groups. HyperLogLog is for approximate cardinality counting, Bitmap is for bit-level state tracking, and GEO is for location-based queries. Redis automatically changes encodings when data grows beyond thresholds, so design should consider both the external data type and internal memory/performance behavior.

Q2. 缓存击穿、穿透、雪崩分别是什么？怎么解决？

EN: What are cache breakdown, cache penetration, and cache avalanche? How do you prevent each?

难度： ★★★★★ | 出现频率： 极高（阿里、美团、字节，几乎所有公司必问）

Key Terms: cache penetration (缓存穿透), cache breakdown (缓存击穿), cache avalanche (缓存雪崩), bloom filter (布隆过滤器), mutex lock (互斥锁), random TTL (随机过期时间), null caching (空值缓存)

答案要点：

缓存穿透（Penetration）：查询不存在的数据，缓存和 DB 都没有，每次请求打到 DB

- 解决：① 布隆过滤器（Bloom Filter）拦截不存在的 key ② 缓存空值（设短 TTL）③ 接口层参数校验

缓存击穿（Breakdown）：热点 key 过期的瞬间，大量并发请求同时打到 DB

- 解决：① 互斥锁（SET key value NX EX）只允许一个线程重建缓存 ② 逻辑过期（不设 TTL，数据中存过期时间，过期后异步更新）③ 热点 key 永不过期

缓存雪崩（Avalanche）：大量 key 同时过期，或 Redis 宕机，请求全打到 DB

- 解决：① 过期时间加随机值（TTL = base + random(0, 300)）② 多级缓存（Redis + Caffeine）③ Redis 集群高可用 ④ 限流降级

常见误区：

❌ 把缓存穿透和缓存击穿搞混 → ✅ 穿透是查询根本不存在的数据（绕过缓存直击 DB），击穿是热点 key 过期瞬间的并发问题
❌ 认为布隆过滤器能完美解决穿透 → ✅ 布隆过滤器存在误判率（可能拦截合法请求），且不支持删除已添加的元素（可用 Counting Bloom Filter 改进）
❌ 缓存雪崩只考虑了 key 同时过期 → ✅ 雪崩还有另一个场景：Redis 节点宕机导致整个缓存层不可用
❌ Confusing cache penetration with cache breakdown → ✅ Penetration means querying data that doesn't exist at all (bypasses cache and hits DB), while breakdown is a concurrency issue when a hot key expires
❌ Believing Bloom filters perfectly solve penetration → ✅ Bloom filters have a false positive rate (may block legitimate requests) and don't support deletion (Counting Bloom Filter can help)
❌ Only considering simultaneous key expiration for avalanche → ✅ Avalanche also occurs when a Redis node goes down, making the entire cache layer unavailable

延伸追问：

布隆过滤器的误判率怎么计算？如果数据量很大怎么解决内存问题？
逻辑过期方案中，过期后返回的是旧数据还是等待更新？这种 trade-off 在什么场景下可以接受？
如果用互斥锁防击穿，锁的超时时间怎么设置？如果锁获取失败的线程怎么处理？
How do you calculate the false positive rate of a Bloom filter? How to handle memory issues at large scale?
In the logical expiration approach, does the system return stale data or wait for a refresh? When is this trade-off acceptable?
How should you set the timeout for a mutex lock used to prevent cache breakdown? What happens to threads that fail to acquire the lock?

风控关联：

风控热点数据（如商户风险等级）的缓存策略直接影响决策延迟。建议热点 key 用逻辑过期 + 异步刷新
Caching strategy for hot risk control data (e.g., merchant risk levels) directly impacts decision latency. Use logical expiration with async refresh for hot keys
关联特征平台

English Answer：

Cache penetration means requests repeatedly query data that does not exist in either Redis or the database, so every request reaches the database. Typical solutions are a Bloom filter to reject definitely nonexistent keys before hitting the cache or database, caching null values with a short TTL, and parameter validation at the API layer to block obviously invalid requests.
Cache breakdown means a hot key expires at the same moment that many concurrent requests arrive, causing all of them to rebuild the cache from the database. Common solutions are a mutex lock, for example SET key value NX EX, so only one thread rebuilds the cache; logical expiration, where the value stores an expiration timestamp and stale data may be returned while an async refresh runs; or making extremely hot keys never expire and refreshing them through background jobs.
Cache avalanche means many keys expire at the same time, or the Redis layer becomes unavailable, so traffic falls through to the database at scale. Solutions include adding random jitter to TTLs, using multi-level cache such as Caffeine plus Redis, deploying Redis with high availability, and adding rate limiting, circuit breaking, and degradation to protect the database.
The key distinction is the failure pattern. Penetration is about nonexistent data, breakdown is about one hot key expiring under high concurrency, and avalanche is about many keys or the cache layer failing together. The mitigation must match the pattern; for example, a Bloom filter helps penetration but does not solve a Redis outage.

Q3. Redis 分布式锁怎么实现？Redisson 和手写有什么区别？

EN: How do you implement a distributed lock with Redis? Compare Redisson with a hand-rolled solution.

难度： ★★★★★ | 出现频率： 极高（阿里、美团、字节、蚂蚁）

Key Terms: SET NX EX (原子加锁), Redisson (分布式锁框架), watchdog (看门狗续期), RedLock (多节点锁), Lua script (Lua脚本释放锁), lock renewal (锁续期), fairness (公平锁)

答案要点：

手写基础版：SET lock_key unique_value NX EX 30，释放时用 Lua 脚本保证原子性
问题：锁过期但业务未完成 → 并发安全问题；不可重入；无等待队列
Redisson 解决方案：

- 看门狗（Watchdog）：默认锁 30s，每 10s 自动续期，业务完成才释放 - 可重入：Hash 结构记录重入次数 - 等待队列：基于 Redis Pub/Sub 实现公平锁 - RedLock：多节点加锁（N/2+1 成功），解决单节点故障问题（但有争议，Martin Kleppmann 曾发文质疑）

Redisson Lua 脚本加锁核心：

``lua if (redis.call('exists', KEYS[1]) == 0) then redis.call('hset', KEYS[1], ARGV[2], 1) redis.call('pexpire', KEYS[1], ARGV[1]) return nil end if (redis.call('hexists', KEYS[1], ARGV[2]) == 1) then redis.call('hincrby', KEYS[1], ARGV[2], 1) redis.call('pexpire', KEYS[1], ARGV[1]) return nil end return redis.call('pttl', KEYS[1]) ``

常见误区：

❌ 释放锁时直接 DEL → ✅ 必须用 Lua 脚本先判断 value 是否是自己持有的，再删除，否则可能误删别人的锁
❌ 认为加锁成功就万事大吉，不考虑锁超时 → ✅ 业务执行时间可能超过锁的过期时间，需要看门狗续期机制
❌ 盲目使用 RedLock 认为它能保证绝对安全 → ✅ RedLock 存在时钟漂移等问题，Martin Kleppmann 和 Antirez 之间有著名争论，多数场景下单节点 Redisson 足够
❌ Using DEL directly when releasing a lock → ✅ You must use a Lua script to check whether the value belongs to you before deleting, otherwise you may delete someone else's lock
❌ Assuming lock acquisition means everything is safe → ✅ Business execution may outlive the lock's expiration; a watchdog renewal mechanism is essential
❌ Blindly trusting RedLock for absolute safety → ✅ RedLock has issues like clock drift — the famous debate between Martin Kleppmann and Antirez shows that single-node Redisson suffices for most cases

延伸追问：

如果 Redis 主节点加锁成功后宕机，从节点还未同步锁信息，会出现什么问题？怎么解决？
Redisson 的看门狗机制在客户端宕机时还能正常工作吗？
除了 Redis，还有哪些分布式锁实现方案？各自的优缺点是什么？
What happens if the Redis master crashes after acquiring a lock but before the replica syncs? How do you handle this?
Does Redisson's watchdog still work when the client itself crashes?
What distributed lock alternatives exist beyond Redis? What are their trade-offs?

风控关联：

风控规则热更新时用分布式锁防止并发修改。用户级操作用 userId 粒度锁防重复提交
Distributed locks prevent concurrent modification during risk control rule hot-reloads. User-level operations use userId-granularity locks to prevent duplicate submissions
关联特征平台

English Answer：

A basic Redis distributed lock is acquired with SET lock_key unique_value NX EX 30, which makes the lock acquisition atomic and gives the lock an expiration time. The value must be unique, usually a UUID plus thread identifier, so the client can prove ownership when releasing the lock. Releasing the lock must use a Lua script that checks whether the stored value belongs to the current client before deleting it; using DEL directly may delete another client's lock after the original lock has expired.
The basic hand-written lock has several problems. If the business operation takes longer than the TTL, the lock may expire while the first thread is still executing, allowing another thread to enter the critical section. It is usually not reentrant, has no wait queue or fairness guarantees, and has no built-in renewal mechanism. These edge cases are often missed in simple implementations.
Redisson provides a more complete lock implementation. Its watchdog sets a default lock lease time, commonly 30 seconds, and renews the lock periodically, for example every 10 seconds, as long as the client is alive and the business has not released the lock. Redisson also supports reentrancy by storing a hash field for the owning thread and a reentry count, and it can use Redis Pub/Sub to notify waiting clients for fair locks.
Redisson also offers RedLock, which tries to acquire locks on a majority of independent Redis nodes, but RedLock has known debates around clock drift and failure assumptions. In most business scenarios, a single Redis instance or Redis master with Redisson is enough if the lock is used for efficiency or duplicate prevention. For correctness-critical distributed coordination, ZooKeeper, etcd, or database constraints may be more appropriate.

Q4. Redis 持久化 RDB 和 AOF 有什么区别？怎么选？

EN: Compare RDB and AOF persistence. How do you choose between them?

难度： ★★★★ | 出现频率： 高（阿里、美团、字节）

Key Terms: RDB snapshot (RDB快照), AOF append-only file (AOF追加日志), fsync (刷盘策略), rewrite (重写压缩), mixed persistence (混合持久化), BGSAVE (后台保存)

答案要点：

维度	RDB	AOF
原理	定时全量快照（fork 子进程）	记录每条写命令
恢复速度	快（直接加载二进制）	慢（重放所有命令）
数据安全	可能丢失最后一次快照后的数据	最多丢 1s（everysec）
文件体积	小（压缩二进制）	大（文本命令，需 rewrite 压缩）
对性能影响	fork 时 COW 有内存开销	everysec 模式影响较小

混合持久化（Redis 4.0+）：AOF rewrite 时前半段用 RDB 格式，后半段用 AOF 增量。兼顾恢复速度和数据安全
推荐配置：生产环境开启混合持久化 aof-use-rdb-preamble yes

常见误区：

❌ 认为 AOF 模式下数据一定不丢失 → ✅ AOF 的 fsync 策略决定了数据安全性：always 每条命令都刷盘（性能差）、everysec 每秒刷盘（最多丢 1s）、no 由 OS 决定（可能丢更多）
❌ 认为 RDB 的 BGSAVE 不会阻塞主线程 → ✅ fork 子进程时如果数据量大，主线程会短暂阻塞（copy-on-write 期间还有内存翻倍风险）
❌ 选型时只考虑数据安全不考虑恢复时间 → ✅ 大规模数据集下 AOF 重放恢复可能需要数十分钟，RDB 加载则快得多
❌ Assuming AOF never loses data → ✅ AOF's fsync policy determines data safety: always flushes every command (slow), everysec flushes per second (max 1s loss), no leaves it to the OS (potential for more loss)
❌ Believing BGSAVE never blocks the main thread → ✅ Forking a child process can briefly block the main thread on large datasets, and copy-on-write may double memory usage
❌ Choosing based only on data safety without considering recovery time → ✅ AOF replay on large datasets can take tens of minutes, while RDB loading is much faster

延伸追问：

AOF rewrite 的触发条件是什么？rewrite 过程中如果有新写命令怎么处理？
Redis fork 子进程做 BGSAVE 时，如果内存使用量是 10GB，大概需要多少额外内存？
如果 RDB 文件损坏了，有没有办法恢复部分数据？
What triggers an AOF rewrite? How are new write commands handled during the rewrite process?
When Redis forks a child process for BGSAVE with 10GB memory usage, how much additional memory is roughly needed?
If an RDB file is corrupted, is there any way to recover partial data?

风控关联：

风控系统的 Redis 实例通常包含实时特征数据，建议开启混合持久化，确保故障恢复后特征数据不丢失。对于纯缓存场景可以只用 RDB 或关闭持久化以提升性能
Redis instances in risk control systems typically hold real-time feature data. Enable mixed persistence to prevent feature data loss after failover. For pure caching scenarios, RDB-only or disabling persistence can improve performance
关联特征平台

English Answer：

RDB persistence saves periodic full snapshots. Redis forks a child process to write a compact binary snapshot, so restart recovery is fast because Redis can load the snapshot directly. The trade-off is data loss: if Redis crashes, data written after the last snapshot may be lost. Forking can also briefly block the main thread, and copy-on-write may increase memory usage when the dataset is large and writes continue during snapshotting.
AOF persistence records write commands in an append-only file. Its data safety depends on the fsync policy. always flushes every command and is safer but slow; everysec flushes once per second and usually loses at most one second of data; no leaves flushing to the operating system and may lose more. AOF files are larger and recovery can be slower because Redis must replay commands. AOF rewrite is needed to compact redundant commands.
Redis 4.0 introduced mixed persistence. During AOF rewrite, the beginning of the new AOF file can be written in RDB format, followed by incremental AOF commands. This combines fast RDB-style recovery with better AOF-style data safety, so aof-use-rdb-preamble yes is a common production choice.
The choice depends on the role of Redis. For pure cache, persistence can be disabled or RDB-only may be acceptable. For important feature data, sessions, or risk-control state, I would enable AOF with everysec plus mixed persistence, monitor rewrite latency and disk usage, and still rely on replication and business recovery mechanisms rather than treating Redis persistence as the only durability layer.

Q5. Redis 集群方案有哪些？各有什么优缺点？

EN: What Redis clustering strategies exist? Compare their pros and cons.

难度： ★★★★ | 出现频率： 高（阿里、美团、字节）

Key Terms: master-slave replication (主从复制), Sentinel (哨兵模式), Redis Cluster (集群模式), hash slot (哈希槽), Gossip protocol (Gossip 协议), resharding (重新分片)

答案要点：

主从复制：一主多从，读写分离。缺点：主节点单点故障，手动故障转移
Sentinel（哨兵）：在主从基础上增加 Sentinel 节点监控，自动故障转移。缺点：单主写入瓶颈，不支持在线扩容
Redis Cluster：数据分片（16384 个 hash slot 分布在多主），每个主节点有从节点。支持在线扩缩容、自动故障转移。缺点：不支持多 key 事务（key 需在同一 slot）、运维复杂

常见误区：

❌ 认为 Sentinel 解决了所有高可用问题 → ✅ Sentinel 只解决了主节点故障的自动转移，但写入仍然是单点瓶颈，不支持数据分片
❌ Redis Cluster 中随意使用多 key 操作 → ✅ Cluster 要求多 key 操作的 key 必须在同一个 hash slot，需要使用 hash tag（如 {user}:info 和 {user}:score）确保同 slot
❌ 认为 hash slot 数量可以自定义 → ✅ Redis Cluster 固定 16384 个 slot，不可修改
❌ Believing Sentinel solves all high-availability issues → ✅ Sentinel only provides automatic failover for the master node — writes remain a single-point bottleneck and data sharding is not supported
❌ Using multi-key operations freely in Redis Cluster → ✅ Cluster requires all keys in a multi-key operation to reside in the same hash slot; use hash tags (e.g., {user}:info and {user}:score) to ensure co-location
❌ Assuming the number of hash slots is configurable → ✅ Redis Cluster has a fixed 16384 slots — this cannot be changed

延伸追问：

Redis Cluster 的 Gossip 协议是怎么工作的？节点之间通信延迟会不会影响故障检测？
如果需要在线扩容添加新节点，hash slot 迁移过程中对线上请求有什么影响？
Sentinel 的主观下线和客观下线分别是什么？为什么要区分？
How does the Gossip protocol work in Redis Cluster? Can inter-node communication delays affect failure detection?
What is the impact on live requests during hash slot migration when adding a new node for online scaling?
What are subjective down and objective down in Sentinel? Why does Sentinel distinguish between them?

风控关联：

风控特征缓存建议 Redis Cluster，高可用 + 水平扩展。注意 key 设计使用 hash tag（{userId}）确保同一用户的数据在同一 slot
Redis Cluster is recommended for risk control feature caching, providing high availability and horizontal scaling. Use hash tags ({userId}) in key design to ensure data for the same user resides in the same slot
关联特征平台

English Answer：

The simplest deployment is master-replica replication. One master handles writes, and one or more replicas can serve reads or act as failover candidates. It improves read scalability and data redundancy, but the master is still a single write bottleneck and failover may require manual or external coordination.
Sentinel adds monitoring and automatic failover on top of master-replica replication. Sentinel nodes monitor the master and replicas, distinguish subjective down from objective down, elect a new master when needed, and notify clients. This improves high availability, but it still has only one writable master for a shard and does not provide horizontal data sharding or online capacity expansion by itself.
Redis Cluster is the native sharding solution. It divides data into 16,384 hash slots distributed across multiple masters, and each master can have replicas. It supports automatic failover and online resharding, so it is the usual choice for large-scale deployments. The trade-off is operational complexity and restrictions on multi-key commands, transactions, and Lua scripts: all involved keys must be in the same slot, often enforced with hash tags such as {userId}:profile and {userId}:score.
In practice, I would choose standalone or master-replica for small or non-critical deployments, Sentinel for high availability without sharding, and Redis Cluster when write throughput, memory size, or horizontal scaling requirements exceed a single master. For risk-control feature caches, Redis Cluster with careful key design is usually the right direction.

Q6. Redis 和 MySQL 双写一致性怎么保证？

EN: How do you maintain consistency between Redis cache and MySQL?

难度： ★★★★★ | 出现频率： 极高（阿里、美团、字节）

Key Terms: Cache Aside (旁路缓存), write-through (写穿透), write-behind (异步写回), delayed double delete (延迟双删), Canal (Canal 中间件), binlog (二进制日志)

答案要点：

Cache Aside Pattern（推荐）：

- 读：先读缓存 → 未命中读 DB → 写入缓存 - 写：先更新 DB → 再删除缓存（不是更新缓存）

为什么是删除而不是更新缓存：并发写时更新缓存可能导致旧值覆盖新值
延迟双删：更新 DB → 删缓存 → 延迟 500ms → 再删缓存。解决读写并发时缓存被旧数据回填
最终一致性方案：Canal 监听 MySQL binlog → 异步删除/更新 Redis。不侵入业务代码
强一致性：分布式锁（性能差，一般不用）

常见误区：

❌ 写操作时先删缓存再更新 DB → ✅ 这会导致并发读时缓存被旧数据回填（先删缓存 → 另一个线程读到 DB 旧数据写入缓存 → DB 更新完成但缓存是旧的）。应先更新 DB 再删缓存
❌ 认为延迟双删能保证强一致性 → ✅ 延迟双删仍是最终一致性方案，延迟时间难以精确设定，只能尽量缩小不一致窗口
❌ 缓存删除失败后没有重试机制 → ✅ 删除缓存失败会导致数据长期不一致，需要引入消息队列或 Canal 做可靠删除
❌ Deleting cache before updating the DB during a write → ✅ This causes stale data to be repopulated into the cache by concurrent reads (delete cache → another thread reads stale DB data and writes it to cache → DB updated but cache is stale). Always update DB first, then delete cache
❌ Believing delayed double-delete guarantees strong consistency → ✅ Delayed double-delete is still an eventual consistency approach; the delay is hard to calibrate precisely and only narrows the inconsistency window
❌ Having no retry mechanism when cache deletion fails → ✅ A failed cache deletion causes long-term data inconsistency; introduce a message queue or Canal for reliable deletion

延伸追问：

先更新 DB 再删缓存就一定没有问题吗？极端并发场景下还有什么边界情况？
Canal 监听 binlog 的延迟一般是多少？如果对实时性要求很高怎么办？
如果是读多写少的场景，你会选 Cache Aside 还是 Write-Through？为什么？
Is "update DB first, then delete cache" completely foolproof? What edge cases exist under extreme concurrency?
What is the typical latency of Canal listening to binlog? What if very high real-time consistency is required?
In a read-heavy, write-light scenario, would you choose Cache Aside or Write-Through? Why?

风控关联：

风控规则配置变更后的缓存失效可以用 Canal 监听 binlog 方案，确保规则实时生效
Cache invalidation after risk control rule configuration changes can use Canal to listen to binlog, ensuring rules take effect in real time
关联特征平台

English Answer：

The most common pattern is Cache Aside. On reads, the service reads Redis first. If Redis misses, it reads MySQL and then writes the result back to Redis with a TTL. On writes, the recommended order is to update MySQL first and then delete the cache, rather than updating the cache directly. Updating cache values during concurrent writes can let an old value overwrite a new one.
Deleting the cache before updating the database is risky. A concurrent read may miss the cache, read old data from MySQL, and repopulate Redis with the old value before the database update commits. After the database update completes, Redis still contains stale data. Updating the DB first and then deleting the cache reduces this window, although it still provides eventual consistency rather than strict consistency.
Delayed double delete can further narrow the race window: update the DB, delete the cache, wait for a short delay such as 500ms, and delete the cache again. This handles cases where a concurrent read repopulates stale data during the write. The delay is hard to tune, so it is a mitigation, not a strong consistency guarantee.
For reliable eventual consistency, Canal or another binlog-based component can listen to MySQL binlog and asynchronously invalidate or update Redis. This reduces intrusion into business code and allows retry if cache deletion fails. If strict consistency is required, distributed locks or transactional designs can be used, but they hurt performance and are rarely used for normal cache scenarios.

Q7. Redis BigKey 和 HotKey 问题怎么发现和处理？

EN: How do you detect and handle BigKey and HotKey issues in Redis?

难度： ★★★★ | 出现频率： 中高（美团、字节、蚂蚁）

Key Terms: BigKey (大 Key), HotKey (热 Key), redis-cli --bigkeys (大 Key 扫描), SCAN (遍历扫描), UNLINK (异步删除), local cache (本地缓存), read-only replica (只读从节点)

答案要点：

BigKey 定义：String > 10KB，集合元素 > 5000
危害：阻塞 Redis（DEL 大 key 耗时长）、网络带宽压力、内存不均衡
发现：redis-cli --bigkeys、MEMORY USAGE key、SCAN 遍历
处理：① 拆分（如 Hash 拆分为多个小 Hash）② UNLINK 异步删除 ③ 定期清理
HotKey 发现：redis-cli --hotkeys（需开启 LFU）、监控 QPS 异常、客户端采样
HotKey 处理：① 本地缓存（Caffeine）② 读写分离（增加从节点）③ 热点 key 分散（加随机后缀）

常见误区：

❌ 发现 BigKey 后直接 DEL 删除 → ✅ 大 key 的 DEL 会阻塞 Redis 主线程（Redis 6.0+ 虽有异步删除但仍需注意），应使用 UNLINK 异步删除
❌ 只关注 BigKey 不关注 HotKey → ✅ 即使 key 很小，如果 QPS 极高也会造成单节点瓶颈，HotKey 和 BigKey 是两个不同维度的风险
❌ 认为热点 key 分散（加随机后缀）能彻底解决问题 → ✅ 分散后需要在客户端维护多个 key 的读取逻辑，增加了复杂度，且写入时需要同时写多个副本
❌ Directly using DEL to remove a discovered BigKey → ✅ DEL on a large key blocks the Redis main thread (even Redis 6.0+ async deletion has caveats); use UNLINK for asynchronous deletion instead
❌ Focusing only on BigKey and ignoring HotKey → ✅ Even a small key with extremely high QPS can create a single-node bottleneck; HotKey and BigKey are risks in different dimensions
❌ Believing hot-key sharding (random suffixes) completely solves the problem → ✅ Sharding requires the client to maintain read logic for multiple keys, increasing complexity, and writes must update all replicas simultaneously

延伸追问：

redis-cli --bigkeys 的底层实现原理是什么？它会阻塞 Redis 吗？
如果 BigKey 是一个包含百万元素的 ZSet，怎么在不影响线上服务的情况下清理？
如何在不修改业务代码的前提下自动发现 HotKey？
What is the underlying implementation of redis-cli --bigkeys? Does it block Redis?
How do you clean up a BigKey that is a ZSet with millions of elements without affecting production traffic?
How can you automatically detect HotKeys without modifying business code?

风控关联：

热点商户的风控画像可能是 HotKey，建议本地缓存（Caffeine）+ Redis 双层架构
Risk profiles for hot merchants can become HotKeys; a two-tier architecture with local cache (Caffeine) + Redis is recommended
关联特征平台

English Answer：

A BigKey is usually defined as a String larger than about 10KB or a collection containing more than several thousand elements, such as more than 5,000 members. The exact threshold depends on the business and Redis capacity. BigKeys are dangerous because reading or writing them creates network pressure, deleting them with DEL can block the Redis main thread, and uneven large keys can cause memory imbalance across cluster nodes.
BigKeys can be detected with redis-cli --bigkeys, MEMORY USAGE key, and careful SCAN-based sampling. They should be handled by splitting data into smaller keys, for example sharding a large Hash into multiple Hashes, and by using UNLINK instead of DEL for asynchronous deletion. Cleanup should be gradual and preferably done during low-traffic windows.
A HotKey is a key with extremely high access frequency. Even if it is small, it can overload one Redis node or one cluster slot. HotKeys can be detected with redis-cli --hotkeys when LFU is enabled, QPS monitoring, proxy-side statistics, or client-side sampling.
HotKeys are usually handled with local cache such as Caffeine, read replicas, key sharding with random suffixes, or prewarming. Sharding a hot key adds complexity because clients must read from or write to multiple copies consistently. For risk-control profiles of popular merchants or high-traffic entities, local cache plus Redis is often the most practical solution.

Q8. Redis Lua 脚本有什么用？注意事项？

EN: Why use Lua scripts in Redis? What are the caveats?

难度： ★★★ | 出现频率： 高（阿里、美团、字节）

Key Terms: EVAL (执行脚本), atomicity (原子性), KEYS/ARGV (键/参数), script cache (脚本缓存), SHA1 (哈希校验)

答案要点：

核心作用：将多个 Redis 命令原子执行（Redis 单线程执行 Lua 脚本时不会被其他命令打断）
使用场景：分布式锁释放、限流（滑动窗口）、库存扣减
注意事项：

- Lua 脚本不能有死循环或长时间运行（lua-time-limit 默认 5s） - key 必须通过 KEYS 参数传入（Redis Cluster 路由需要） - 脚本会被缓存（SHA1），用 EVALSHA 减少网络开销

常见误区：

❌ 认为 Lua 脚本能保证多个 key 的强一致性 → ✅ 在 Redis Cluster 中，如果涉及多个 key 且不在同一个 slot，Lua 脚本无法执行，需要用 hash tag 确保同 slot
❌ 在 Lua 脚本中直接硬编码 key 名 → ✅ key 必须通过 KEYS 参数传入，否则 Redis Cluster 无法正确路由请求
❌ Lua 脚本中执行耗时操作没关系，反正不会被其他命令打断 → ✅ Lua 脚本执行期间会阻塞 Redis 主线程，超过 lua-time-limit 后其他请求会收到 BUSY 错误
❌ Assuming Lua scripts guarantee strong consistency across multiple keys → ✅ In Redis Cluster, if multiple keys reside in different slots, the Lua script cannot execute; use hash tags to ensure co-location
❌ Hardcoding key names directly in Lua scripts → ✅ Keys must be passed via the KEYS parameter, otherwise Redis Cluster cannot route the request correctly
❌ Thinking long-running operations in Lua are harmless since no other commands can interrupt → ✅ Lua scripts block the Redis main thread; exceeding lua-time-limit causes other requests to receive BUSY errors

延伸追问：

EVAL 和 EVALSHA 有什么区别？如果服务端没有缓存对应的 SHA1 会怎样？
Lua 脚本执行过程中如果 Redis 宕机，脚本执行到一半的命令会回滚吗？
如何对 Lua 脚本做版本管理和灰度发布？
What is the difference between EVAL and EVALSHA? What happens if the server hasn't cached the corresponding SHA1?
If Redis crashes during Lua script execution, will the partially executed commands be rolled back?
How do you manage versions and perform canary releases for Lua scripts?

风控关联：

风控限流场景（如滑动窗口限流）适合用 Lua 脚本实现原子操作，避免并发下限流不精确
Rate limiting in risk control (e.g., sliding window rate limiting) is well-suited for Lua scripts to ensure atomic operations and prevent inaccurate limits under concurrency
关联特征平台

English Answer：

Redis Lua scripts are used when multiple Redis commands must execute atomically. Because Redis executes a Lua script as a single unit on the main execution thread, no other command interleaves with the script. Common use cases include releasing a distributed lock safely, implementing sliding-window rate limiting, and deducting inventory or quota with read-check-write logic.
Lua scripts are not transactions with rollback. If Redis crashes during execution, already applied writes are not automatically rolled back like a database transaction. The script should therefore be short, deterministic, and idempotent where possible. Long loops or heavy computation are dangerous because scripts block Redis while running, and after lua-time-limit, commonly 5 seconds, Redis may start returning BUSY errors to other clients.
In Redis Cluster, keys must be passed through the KEYS array instead of being hardcoded in the script, because the cluster router needs to know the target slot. If a script operates on multiple keys, those keys must be in the same hash slot, usually by using a hash tag. Otherwise the script cannot run in cluster mode.
For performance, Redis caches scripts by SHA1. The client can send the full script once with EVAL, then invoke it with EVALSHA. If the server does not have the script cached, the client should handle the NOSCRIPT response and fall back to EVAL. In production, Lua scripts should be versioned, tested, and rolled out carefully just like application code.

Q9. Redis Stream 和 List 做消息队列有什么区别？

EN: Compare Redis Stream and List as message queues.

难度： ★★★ | 出现频率： 中高（美团、字节）

Key Terms: Stream (流), XADD/XREAD/XGROUP (流操作命令), Consumer Group (消费组), pending entries list/PEL (待处理列表), ACK (消息确认), last delivered ID (最后投递 ID)

答案要点：

List：简单 FIFO，LPUSH + BRPOP。缺点：无消费组、无消息确认、无历史回溯
Stream（Redis 5.0+）：类似 Kafka，支持消费组（XGROUP）、消息确认（XACK）、未处理消息重分配（XPENDING/XCLAIM）、历史消息回溯
选型：简单场景用 List；需要消费组/消息可靠性用 Stream；重量级用 Kafka/RocketMQ

常见误区：

❌ 认为 Redis Stream 可以完全替代 Kafka → ✅ Stream 是轻量级消息队列，缺少 Kafka 的分区水平扩展、消息回压控制、多副本同步等能力，不适合大规模消息场景
❌ 用 List 做消息队列时不处理消费者宕机导致的消息丢失 → ✅ List 模式下消息一旦 BRPOP 就从队列移除，消费者处理失败则消息丢失，没有重试机制
❌ Stream 的 PEL（pending entries list）不需要维护 → ✅ 长时间未 ACK 的消息会堆积在 PEL 中，需要定期用 XPENDING + XCLAIM 做消息重分配
❌ Believing Redis Stream can fully replace Kafka → ✅ Stream is a lightweight message queue lacking Kafka's partition-level horizontal scaling, backpressure control, and multi-replica sync — unsuitable for large-scale messaging
❌ Not handling message loss from consumer crashes when using List as a queue → ✅ In List mode, once BRPOP removes a message, it's gone — if the consumer fails, the message is lost with no retry mechanism
❌ Neglecting maintenance of Stream's PEL (pending entries list) → ✅ Unacknowledged messages pile up in the PEL; periodic XPENDING + XCLAIM is needed to redistribute them

延伸追问：

Stream 的 MAXLEN 和 MINID trim 策略有什么区别？分别适合什么场景？
Stream 消费组中如果有消费者长时间不 ACK，怎么处理这些 pending 消息？
Stream 的消息 ID 是全局唯一的吗？它的组成结构是什么？
What is the difference between Stream's MAXLEN and MINID trim strategies? Which scenarios suit each?
How do you handle pending messages when a consumer in a Stream consumer group hasn't ACKed for a long time?
Are Stream message IDs globally unique? What is their internal structure?

风控关联：

风控事件流（如交易事件）的轻量级分发可用 Stream，支持消费组实现多策略引擎并行消费
Lightweight distribution of risk control event streams (e.g., transaction events) can use Stream, with consumer groups enabling parallel consumption by multiple strategy engines
关联特征平台

English Answer：

A Redis List can be used as a simple FIFO queue with commands such as LPUSH and BRPOP. It is easy to understand and suitable for lightweight, low-reliability scenarios. The main limitation is that once BRPOP removes a message, Redis no longer tracks it. If the consumer crashes after popping the message but before processing finishes, the message is lost. List also has no native consumer groups, acknowledgments, pending list, or historical replay.
Redis Stream, introduced in Redis 5.0, is closer to a lightweight Kafka-style queue. Producers append messages with XADD, consumers read with XREAD or consumer groups through XGROUP, and messages can be acknowledged with XACK. Stream maintains a pending entries list, or PEL, so unacknowledged messages can be inspected with XPENDING and reassigned with XCLAIM or related commands.
Stream also supports historical replay by message ID and trimming through strategies such as MAXLEN and MINID. However, it is still not a full replacement for Kafka or RocketMQ. It lacks Kafka's mature partition-level scaling, large-scale retention model, backpressure ecosystem, and cross-cluster messaging capabilities.
My selection rule is simple: use List for very simple FIFO tasks where occasional loss is acceptable or handled elsewhere; use Stream when consumer groups, ACK, retry, and replay are needed; use Kafka, RocketMQ, or another dedicated MQ for large-scale event streams, long retention, and stronger operational guarantees.

Q10. Redis 内存淘汰策略有哪些？怎么选？

EN: What are Redis eviction policies? How do you choose?

难度： ★★★ | 出现频率： 中高（阿里、美团）

Key Terms: maxmemory (最大内存), noeviction (不淘汰), allkeys-lru (全量 LRU), volatile-lru (带过期 LRU), allkeys-lfu (全量 LFU), volatile-lfu (带过期 LFU), random (随机淘汰)

答案要点：

不淘汰：noeviction（默认，内存满了拒绝写入）
LRU：allkeys-lru（所有 key 最近最少使用）/ volatile-lru（只淘汰有过期时间的）
LFU：allkeys-lfu（所有 key 最少使用频率）/ volatile-lfu（Redis 4.0+）
Random：allkeys-random / volatile-random
推荐：缓存场景用 allkeys-lfu；有明确冷热数据用 volatile-lru

常见误区：

❌ 认为 Redis 的 LRU 是精确的最近最少使用 → ✅ Redis 的 LRU 是近似算法，默认采样 5 个 key 选最久未使用的淘汰，通过 maxmemory-samples 可调大采样数提高精度（但消耗更多 CPU）
❌ 所有场景都用 allkeys-lru → ✅ 如果 Redis 中既有缓存数据又有持久化数据（如配置信息），应用 volatile-lru 避免持久化数据被淘汰
❌ 设置了 maxmemory 但没选淘汰策略 → ✅ 默认策略是 noeviction，内存满后会直接拒绝写入，可能导致线上故障
❌ Assuming Redis LRU is a precise least-recently-used implementation → ✅ Redis uses an approximate LRU algorithm that samples 5 keys by default and evicts the least recently used; increasing maxmemory-samples improves accuracy at the cost of more CPU
❌ Using allkeys-lru for every scenario → ✅ If Redis holds both cached data and persistent data (e.g., configuration), use volatile-lru to prevent persistent data from being evicted
❌ Setting maxmemory without choosing an eviction policy → ✅ The default policy is noeviction, which rejects all writes when memory is full — this can cause production outages

延伸追问：

Redis 的近似 LRU 算法和严格 LRU 相比，淘汰准确度差多少？
LFU 的计数器是怎么实现的？长时间不访问的 key 计数器会衰减吗？
如果 Redis 既做缓存又存了一些必须保留的数据，你会怎么配置淘汰策略？
How much less accurate is Redis's approximate LRU compared to strict LRU?
How is the LFU counter implemented? Does the counter decay for keys that haven't been accessed for a long time?
If Redis serves as both a cache and a store for must-keep data, how would you configure the eviction policy?

风控关联：

风控特征缓存建议 allkeys-lfu，热点数据自然保留，冷数据被淘汰，契合风控场景的访问模式
Risk control feature caching should use allkeys-lfu — hot data is naturally retained while cold data gets evicted, matching the access patterns of risk control scenarios
关联特征平台

English Answer：

Redis eviction is controlled by maxmemory and the eviction policy. The default policy is noeviction, which means Redis rejects writes when memory is full. This is safe for data that must not be evicted, but it can cause production write failures if Redis is being used as a cache and no policy is configured.
LRU policies evict keys that have not been used recently. allkeys-lru considers all keys, while volatile-lru only considers keys with an expiration time. Redis LRU is approximate, not exact: it samples a number of keys, five by default, and evicts the least recently used among the sample. Increasing maxmemory-samples improves accuracy but costs more CPU.
LFU policies, available since Redis 4.0, evict keys based on access frequency rather than recency. allkeys-lfu applies to all keys, while volatile-lfu applies only to keys with TTL. Random policies, such as allkeys-random and volatile-random, randomly evict keys from all keys or only expiring keys.
For pure cache scenarios, allkeys-lfu is often a good default because hot data naturally stays in memory and cold data is evicted. If Redis stores both cache data and must-keep configuration or state, I would use volatile-lru or volatile-lfu and ensure only cache keys have TTLs. The important operational rule is to set both maxmemory and an explicit eviction policy, then monitor evictions, hit rate, and rejected writes.

关联

MySQL — 双写一致性是高频系统设计题
消息队列 — Redis Stream vs Kafka 的选型对比
分布式系统 — 分布式锁是两者交叉考点
特征平台 — 特征缓存的 Redis 架构设计