分布式系统

覆盖 CAP/BASE、分布式事务、限流、分布式锁、ID 生成、RPC、幂等、网关、熔断降级、分布式 Session、任务调度等高频考点。每道题包含中英双语答案、代码示例、常见误区和风控关联。

中英双语版本，适合英文面试准备。

相关页面：Redis · 消息队列 · MySQL · 实时风控引擎

Q1. CAP 定理是什么？BASE 理论又是什么？怎么在实际系统中取舍？

EN: Explain the CAP theorem and BASE theory. How do you make tradeoffs in real systems?

难度： ★★★★ | 出现频率： 极高（阿里、美团、字节、蚂蚁）

Key Terms: Consistency (一致性), Availability (可用性), Partition tolerance (分区容错), eventual consistency (最终一致性), soft state (软状态), basically available (基本可用)

答案要点：

CAP 定理：分布式系统最多同时满足三个中的两个。因为网络分区（P）不可避免，实际选择是 CP（牺牲可用性）或 AP（牺牲强一致性）
CP 系统典型代表：

- ZooKeeper（Leader 选举期间不可用） - Etcd、HBase

AP 系统典型代表：

- Eureka（节点间不保证数据一致） - Cassandra、DNS

BASE 理论：Basically Available（基本可用，允许响应变慢）、Soft State（中间状态）、Eventual Consistency（最终一致性）
实际取舍：金融场景偏 CP（转账不能出错），互联网场景偏 AP（短暂不一致可接受）

常见误区：

❌ 认为 CAP 可以同时满足三个 → ✅ 网络分区不可避免，只能在 C 和 A 之间二选一
❌ Believing CAP can be satisfied simultaneously → ✅ Network partitions are inevitable, so you must choose between C and A
❌ 认为 AP 系统完全不需要一致性 → ✅ AP 追求的是最终一致性，数据最终会收敛一致
❌ Thinking AP systems don't need consistency at all → ✅ AP systems pursue eventual consistency — data will converge over time
❌ 把 BASE 理论和 ACID 对立 → ✅ BASE 是对 CAP 在工程实践中的补充和折中，不是对立关系
❌ Treating BASE and ACID as opposites → ✅ BASE is a practical supplement and compromise to CAP, not an opposition to ACID

延伸追问：

ZooKeeper 是 CP 系统，那注册中心场景下 CP 好还是 AP 好？为什么？
在风控系统中，哪些数据必须强一致，哪些可以接受最终一致性？
分布式系统中网络分区真的不可避免吗？有没有办法减少分区发生的概率？
Is CP or AP better for a service registry? Why does the choice matter?
What data in a risk control system must be strongly consistent, and what can tolerate eventual consistency?

风控关联：

风控决策要求强一致性（不能漏过风险交易），但实时特征可以接受短暂不一致（AP）
Risk control decisions require strong consistency (no risky transactions should slip through), but real-time features can tolerate brief inconsistency (AP)
关联实时风控引擎

English Answer：

The CAP theorem says a distributed system can satisfy at most two of Consistency, Availability, and Partition tolerance at the same time. In real distributed systems, network partitions are considered unavoidable, so the practical tradeoff is usually between CP and AP. A CP system keeps consistency during a partition but may sacrifice availability. Typical examples are ZooKeeper, which may become temporarily unavailable during leader election, as well as Etcd and HBase. An AP system keeps serving requests during a partition but gives up strong consistency and relies on convergence later. Eureka, Cassandra, and DNS are typical AP-style examples.

BASE is an engineering philosophy for AP-oriented systems. It stands for Basically Available, meaning the system can still provide degraded or slower responses; Soft State, meaning intermediate states are allowed; and Eventual Consistency, meaning replicas do not have to be immediately consistent but should converge over time. In system design, financial and fund-related scenarios usually prefer CP because a transfer or risk decision cannot be wrong. Internet-facing scenarios often prefer AP because brief inconsistency is acceptable if availability is more important. In risk control, the final decision data should be strongly consistent, while some real-time features can tolerate short-lived inconsistency.

Q2. 分布式事务有哪些方案？各有什么优缺点？

EN: What are the common distributed transaction patterns? Compare their pros and cons.

难度： ★★★★★ | 出现频率： 极高（阿里、美团、字节，几乎所有公司必问）

Key Terms: 2PC (两阶段提交), TCC (Try-Confirm-Cancel), Saga (长事务编排), local message table (本地消息表), maximum effort notification (最大努力通知), Seata (阿里分布式事务框架), compensate (补偿)

答案要点：

方案	原理	优点	缺点	适用场景
2PC	协调者统一 prepare → commit	强一致	同步阻塞、单点故障	数据库层面
TCC	Try-Confirm-Cancel 三阶段	灵活、无锁	业务侵入大、需写补偿逻辑	资金/交易
Saga	正向操作链 + 补偿操作链	长事务友好	无隔离性、补偿复杂	业务流程编排
本地消息表	本地事务 + 消息表 + 异步投递	实现简单	消息表膨胀、需定时清理	最终一致性
最大努力通知	调用方重试直到对方确认	最简单	不保证一定成功	跨系统通知

Seata：阿里开源分布式事务框架，支持 AT（自动补偿）/TCC/Saga/XA 四种模式。AT 模式最常用（一阶段提交 + 全局锁 + 二阶段自动补偿）

常见误区：

❌ 认为引入 Seata 就能解决所有分布式事务问题 → ✅ Seata 的 AT 模式依赖全局锁，高并发下性能瓶颈明显，需根据场景选型
❌ Assuming Seata solves all distributed transaction problems → ✅ Seata's AT mode relies on global locks, which become a clear bottleneck under high concurrency — choose based on your scenario
❌ 认为 Saga 能保证隔离性 → ✅ Saga 没有隔离性保证，中间状态对外可见，需要通过业务层面设计弥补
❌ Believing Saga guarantees isolation → ✅ Saga has no isolation guarantee — intermediate states are visible externally and must be handled at the business level
❌ 混淆 2PC 和 TCC → ✅ 2PC 是数据库层面的两阶段提交协议，TCC 是业务层面的资源预留模式
❌ Confusing 2PC with TCC → ✅ 2PC is a database-level two-phase commit protocol, while TCC is a business-level resource reservation pattern

延伸追问：

TCC 的 Cancel 阶段如果也失败了怎么办？空回滚和幂等怎么处理？
Seata AT 模式的全局锁机制是怎么实现的？和数据库行锁有什么区别？
在风控场景中，如果风控决策和交易扣款需要保证一致性，你会选哪种方案？
What happens if the Cancel phase in TCC also fails? How do you handle empty rollback and idempotency?
How does Seata AT mode implement global locks, and how do they differ from database row locks?

风控关联：

风控决策 + 交易处理涉及多个服务（风控引擎/订单/账户），用 TCC 保证资金安全
Risk control decisions and transaction processing span multiple services (risk engine, order, account); TCC ensures fund safety
关联实时风控引擎

English Answer：

Common distributed transaction patterns include 2PC, TCC, Saga, local message table, and maximum effort notification. 2PC uses a coordinator to run a prepare phase and then a commit phase. It provides strong consistency, but it is synchronously blocking and the coordinator can become a single point of failure, so it is mainly seen at the database layer. TCC splits business logic into Try, Confirm, and Cancel. It is flexible and avoids holding database locks for a long time, but it is highly intrusive because every business operation needs compensation logic. It is suitable for fund and transaction scenarios.

Saga is a chain of forward business operations plus compensating operations. It fits long-running business workflows, but it does not provide isolation; intermediate states are visible and compensation can become complex. The local message table pattern combines a local database transaction with a message table and asynchronous delivery. It is relatively simple and provides eventual consistency, but the message table can grow quickly and needs cleanup and retry handling. Maximum effort notification keeps retrying until the downstream system confirms, which is the simplest cross-system notification pattern, but it cannot guarantee success.

Seata is Alibaba's open-source distributed transaction framework. It supports AT, TCC, Saga, and XA modes. AT mode is the most commonly used: the first phase commits local transactions and records undo logs, while the second phase uses a global lock and automatic compensation. In practice, I would choose TCC for fund safety, Saga for workflow orchestration, local message tables for eventual consistency, and avoid assuming Seata can solve every scenario because AT mode can become a bottleneck under high concurrency.

Q3. 分布式 ID 生成方案有哪些？雪花算法时钟回拨怎么处理？

EN: What are the distributed ID generation strategies? How do you handle clock drift in the Snowflake algorithm?

难度： ★★★★ | 出现频率： 高（阿里、美团、字节）

Key Terms: Snowflake (雪花算法), clock drift (时钟回拨), worker ID (工作机器ID), sequence (序列号), UUID (通用唯一识别码), database auto-increment (数据库自增), Leaf (美团分布式ID框架)

答案要点：

雪花算法（Snowflake）：64 bit = 1 bit 符号 + 41 bit 时间戳 + 10 bit 机器 ID + 12 bit 序列号。每毫秒可生成 4096 个 ID
时钟回拨处理：

- 回拨 < 5ms：自旋等待追上 - 回拨 > 5ms：抛异常拒绝生成 / 使用历史最大时间戳 + 借用未来时间 - 百度 UidGenerator：RingBuffer 预生成，不依赖实时时钟 - 美团 Leaf：Leaf-snowflake 模式用 ZooKeeper 管理 worker ID，检测到回拨则告警

其他方案对比：

- UUID：无序、索引效率差、无业务含义 - 数据库号段（Leaf-segment）：号段模式，批量获取 - Redis INCR：简单但依赖 Redis 可用性

常见误区：

❌ 认为雪花算法生成的 ID 绝对有序 → ✅ 雪花 ID 只是趋势递增（同一毫秒内有序），不同机器之间无法保证严格有序
❌ Assuming Snowflake IDs are strictly ordered → ✅ Snowflake IDs are only monotonically increasing (ordered within the same millisecond); strict global ordering across machines is not guaranteed
❌ 忽视时钟回拨问题直接上线 → ✅ NTP 同步可能导致时钟回拨，必须设计兜底策略
❌ Ignoring clock drift issues in production → ✅ NTP synchronization can cause clock drift; a fallback strategy must be designed before deployment
❌ 认为 worker ID 可以随便分配 → ✅ worker ID 必须全局唯一，需要借助 ZooKeeper/etcd 等协调服务管理
❌ Assuming worker IDs can be assigned arbitrarily → ✅ Worker IDs must be globally unique and managed via coordination services like ZooKeeper or etcd

延伸追问：

雪花算法的 41 bit 时间戳能用多少年？到期后怎么办？
Leaf-segment 号段模式和 Leaf-snowflake 模式各适合什么场景？
分布式 ID 如果需要包含业务含义（如商户号），应该怎么设计？
How long does the 41-bit timestamp in Snowflake last? What happens when it runs out?
When should you choose Leaf-segment over Leaf-snowflake?

风控关联：

风控系统中的交易流水号、决策记录 ID 均需要分布式唯一 ID，保证全局可追溯
Transaction serial numbers and decision record IDs in risk control systems require globally unique distributed IDs for full traceability
关联实时风控引擎

English Answer：

Snowflake generates a 64-bit ID. The common layout is 1 sign bit, 41 timestamp bits, 10 worker ID bits, and 12 sequence bits. With 12 sequence bits, one worker can generate 4096 IDs per millisecond. The IDs are trend-increasing, but they are not strictly globally ordered across all machines.

The biggest operational risk is clock rollback. If the clock moves backward by a very small amount, for example less than 5 ms, the generator can spin-wait until the clock catches up. If the rollback is larger, the safer choices are to reject ID generation, use the historical maximum timestamp, or borrow future time with strict safeguards. Baidu UidGenerator reduces real-time clock dependency by pre-generating IDs in a RingBuffer. Meituan Leaf has two modes: Leaf-segment obtains database ID segments in batches, while Leaf-snowflake uses ZooKeeper to manage worker IDs and alerts when clock rollback is detected.

Other options include UUID, database segment allocation, and Redis INCR. UUID is simple and globally unique, but it is unordered, hurts index locality, and usually has no business meaning. Database segments are efficient because IDs are allocated in batches, while Redis INCR is simple but depends on Redis availability. In a risk control system, transaction serial numbers and decision record IDs must be globally unique so that every decision can be traced and audited.

Q4. 限流算法有哪些？Sentinel 和 Guava RateLimiter 有什么区别？

EN: What are the common rate limiting algorithms? Compare Sentinel with Guava RateLimiter.

难度： ★★★★★ | 出现频率： 极高（阿里、美团、字节、蚂蚁）

Key Terms: fixed window (固定窗口), sliding window (滑动窗口), token bucket (令牌桶), leaky bucket (漏桶), Sentinel (阿里巴巴流量防护组件), Guava RateLimiter (Guava限流器), warmup (预热), circuit breaker (熔断器)

答案要点：

算法	原理	优点	缺点
固定窗口	固定时间段内计数	简单	窗口边界突发流量（临界问题）
滑动窗口	细粒度时间片滑动	平滑	内存开销
漏桶（Leaky Bucket）	固定速率流出	流量平滑	无法应对突发
令牌桶（Token Bucket）	固定速率放令牌，请求取令牌	允许合理突发	实现复杂

Guava RateLimiter：单机令牌桶，SmoothBursty（匀速）/ SmoothWarmingUp（预热）
Sentinel：阿里巴巴分布式限流框架，支持滑动窗口统计、热点参数限流、熔断降级、系统保护、集群限流

代码示例：


// Sentinel 风控接口限流示例
@SentinelResource(value = "riskEvaluate", blockHandler = "riskBlockHandler")
public Decision evaluate(Transaction tx) {
    return doEvaluate(tx);
}

// 限流降级处理
public Decision riskBlockHandler(BlockException ex) {
    return Decision.builder()
        .result("REVIEW")  // 限流时转人工审核
        .reason("系统繁忙，转人工")
        .build();
}

常见误区：

❌ 认为固定窗口限流足够用 → ✅ 固定窗口存在临界突发问题，两个窗口交界处可能通过 2 倍流量
❌ Believing fixed-window rate limiting is sufficient → ✅ Fixed windows suffer from boundary burst — traffic can double at window edges
❌ 混淆漏桶和令牌桶 → ✅ 漏桶以固定速率处理请求（不允许突发），令牌桶允许短时间内消费积累的令牌（允许合理突发）
❌ Confusing leaky bucket with token bucket → ✅ Leaky bucket processes requests at a fixed rate (no bursts), while token bucket allows consuming accumulated tokens for controlled bursts
❌ 认为 Sentinel 只是限流工具 → ✅ Sentinel 是完整的流量防护组件，集限流、熔断降级、系统保护、热点限流于一体
❌ Treating Sentinel as just a rate limiter → ✅ Sentinel is a comprehensive traffic protection component integrating rate limiting, circuit breaking, degradation, system protection, and hot parameter limiting

延伸追问：

如果需要做集群维度的限流（比如全局限流 1000 QPS），应该怎么实现？
Sentinel 的热点参数限流底层是怎么实现的？滑动窗口的数据结构是什么？
令牌桶算法中令牌的积累上限怎么设置？和突发流量的关系是什么？
How would you implement cluster-level rate limiting (e.g., a global 1000 QPS cap)?
How does Sentinel's hot parameter rate limiting work internally? What data structure does the sliding window use?

风控关联：

风控接口必须做限流保护。Sentinel 的热点参数限流可以按商户/用户粒度限流，防止恶意请求打爆风控引擎
Risk control APIs must be rate-limited. Sentinel's hot parameter limiting enables per-merchant or per-user granular rate limiting, preventing malicious requests from overwhelming the risk engine
关联实时风控引擎

English Answer：

Common rate limiting algorithms include fixed window, sliding window, leaky bucket, and token bucket. A fixed window counts requests in a fixed time interval. It is simple, but it has the boundary burst problem: traffic near two adjacent window edges may double the expected limit. A sliding window divides time into smaller buckets and slides the window forward, which is smoother but costs more memory and computation. A leaky bucket drains requests at a fixed rate, so it smooths traffic well but cannot handle bursts. A token bucket adds tokens at a fixed rate, and each request must consume a token; it keeps the average rate stable while allowing controlled bursts.

Guava RateLimiter is a single-node token bucket style limiter. It provides SmoothBursty for stable burst tolerance and SmoothWarmingUp for gradual warm-up. It is easy to use inside one JVM but does not solve distributed or cluster-level limiting by itself. Sentinel is Alibaba's traffic protection component. It supports sliding-window statistics, hot parameter rate limiting, circuit breaking, degradation, system protection, and cluster rate limiting. So Guava is suitable for local in-process limiting, while Sentinel is more suitable for microservice traffic protection. In risk control, Sentinel's hot parameter limiting can rate-limit by merchant, user, or other business dimensions to prevent malicious traffic from overwhelming the risk engine.

Q5. RPC 框架的原理是什么？和 HTTP 调用有什么区别？

EN: How does an RPC framework work? Compare it with HTTP-based calls.

难度： ★★★ | 出现频率： 高（阿里、美团、字节）

Key Terms: dynamic proxy (动态代理), serialization (序列化), service registry (服务注册), load balancing (负载均衡), Dubbo (阿里RPC框架), gRPC (Google RPC框架), protocol buffer (协议缓冲区)

答案要点：

RPC 核心流程：客户端代理（动态代理）→ 序列化 → 网络传输 → 服务端反序列化 → 调用实现 → 返回
RPC vs HTTP：

- RPC 使用自定义二进制协议（如 Dubbo 协议），更轻量、更快 - HTTP 通用性更好（跨语言、跨环境），RESTful 语义清晰 - RPC 框架内置服务发现、负载均衡、熔断降级；HTTP 需要额外组件

选型建议：内部微服务通信用 RPC（Dubbo/gRPC），对外暴露 API 用 HTTP

常见误区：

❌ 认为 RPC 一定比 HTTP 快 → ✅ gRPC 底层用的也是 HTTP/2，性能差距在多数场景下不大，选型更应考虑生态和团队熟悉度
❌ Assuming RPC is always faster than HTTP → ✅ gRPC runs on HTTP/2 under the hood; the performance gap is negligible in most cases — ecosystem fit and team familiarity matter more
❌ 认为 RPC 和 REST 是对立关系 → ✅ RPC 是调用方式，REST 是架构风格，两者解决不同层面的问题
❌ Treating RPC and REST as mutually exclusive → ✅ RPC is an invocation paradigm, REST is an architectural style — they solve problems at different levels
❌ 忽视 RPC 的服务治理复杂度 → ✅ RPC 框架引入了注册中心、配置中心等额外依赖，运维成本更高
❌ Overlooking the operational overhead of RPC → ✅ RPC frameworks introduce additional dependencies like service registries and config centers, increasing operational complexity

延伸追问：

Dubbo 的负载均衡策略有哪些？默认是哪个？一致性哈希在什么场景下使用？
gRPC 和 Dubbo 各自的优劣势是什么？什么场景下选 gRPC？
RPC 框架如何实现服务熔断和降级？和 Hystrix/Resilience4j 有什么关系？
What load balancing strategies does Dubbo provide? Which one is the default, and when should you use consistent hashing?
How do RPC frameworks implement circuit breaking and degradation? How do they relate to Hystrix or Resilience4j?

风控关联：

风控引擎与各业务服务之间的内部调用通常使用 RPC（Dubbo/gRPC），降低延迟、提高吞吐
Internal calls between the risk engine and business services typically use RPC (Dubbo/gRPC) to reduce latency and increase throughput
关联实时风控引擎

English Answer：

The core flow of an RPC framework is: the client calls a local proxy, usually generated by a dynamic proxy; the framework serializes the request; sends it over the network; the server deserializes the request; invokes the real service implementation; and serializes the response back to the client. The goal is to make remote calls look like local method calls while still handling transport, serialization, timeout, retry, and service governance.

Compared with HTTP-based calls, RPC frameworks often use custom binary protocols, such as the Dubbo protocol, so the protocol can be lighter and faster in internal service-to-service communication. HTTP has better universality: it is easier for cross-language, cross-environment, and external integrations, and RESTful APIs have clearer resource semantics. RPC frameworks usually provide service discovery, load balancing, circuit breaking, and degradation out of the box. With plain HTTP, these capabilities normally require additional components such as a registry, gateway, client load balancer, or resilience library.

My rule of thumb is to use RPC frameworks such as Dubbo or gRPC for internal microservice communication and use HTTP for public or partner-facing APIs. In risk control systems, the risk engine often communicates with order, account, payment, and user services through RPC to reduce latency and improve throughput.

Q6. 如何设计一个幂等的接口？

EN: How do you design an idempotent API?

难度： ★★★★ | 出现频率： 极高（阿里、美团、字节）

Key Terms: idempotency (幂等性), idempotency key (幂等键), token mechanism (Token机制), unique constraint (唯一约束), optimistic lock (乐观锁), state machine (状态机)

答案要点：

幂等性定义：同一操作执行一次和多次效果相同
实现方案：

- 唯一索引：数据库唯一约束防止重复插入 - Token 机制：服务端发 Token → 客户端携带 Token 请求 → 服务端验证并删除 Token - 乐观锁：UPDATE ... SET version = version + 1 WHERE version = ? - 状态机：订单状态只能单向流转（待支付→已支付→已发货） - 去重表：请求 ID 写入去重表，幂等判断

代码示例：


// 风控决策幂等示例
public Decision evaluate(String requestId, Transaction tx) {
    // 1. 查去重表
    Decision cached = decisionCache.get(requestId);
    if (cached != null) return cached;

    // 2. 分布式锁防止并发
    String lockKey = "risk:eval:" + requestId;
    try {
        if (redisLock.tryLock(lockKey, 30, TimeUnit.SECONDS)) {
            // double check
            cached = decisionCache.get(requestId);
            if (cached != null) return cached;

            Decision result = doEvaluate(tx);
            decisionCache.put(requestId, result);
            return result;
        }
        throw new RiskException("请求处理中，请稍后重试");
    } finally {
        redisLock.unlock(lockKey);
    }
}

常见误区：

❌ 认为 GET 请求天然幂等就不需要处理 → ✅ 虽然 HTTP 语义上 GET 是幂等的，但业务层面仍需防止重复处理（如重复查询触发副作用）
❌ Assuming GET requests are inherently idempotent and need no handling → ✅ While GET is idempotent by HTTP semantics, the business layer still needs to guard against duplicate side effects
❌ 认为加分布式锁就等于幂等 → ✅ 分布式锁只解决并发问题，还需要结合去重表/唯一约束保证持久化层面的幂等
❌ Equating distributed locks with idempotency → ✅ Distributed locks only handle concurrency; dedup tables or unique constraints are needed for persistent-level idempotency
❌ 忽略幂等 key 的设计 → ✅ 幂等 key 必须全局唯一且有业务含义，不能简单用所有请求参数的哈希
❌ Overlooking idempotency key design → ✅ The idempotency key must be globally unique and carry business semantics — simply hashing all request parameters is error-prone

延伸追问：

幂等 key 应该怎么设计？如果用请求参数的哈希有什么风险？
分布式锁 + 去重表方案中，如果去重表写入成功但业务执行失败，怎么处理？
幂等接口的返回值应该怎么设计？第一次请求和重复请求的响应有什么区别？
How should you design an idempotency key? What are the risks of hashing request parameters?
In the distributed lock + dedup table approach, what happens if the dedup insert succeeds but the business logic fails?

风控关联：

风控决策接口必须幂等！同一笔交易重复请求不能产生不同决策
Risk control decision APIs must be idempotent — repeated requests for the same transaction must never produce different decisions
关联实时风控引擎

English Answer：

Idempotency means that executing the same operation once or multiple times produces the same business effect. It is essential for APIs that may be retried by clients, gateways, message queues, or job schedulers.

There are several common implementation patterns. First, a unique database constraint can prevent duplicate inserts, such as using a unique business order ID or request ID. Second, a token mechanism can be used: the server issues a token, the client submits the token with the request, and the server validates and deletes the token so it cannot be reused. Third, optimistic locking uses a version field, for example UPDATE ... SET version = version + 1 WHERE version = ?, to ensure only the expected version can be updated. Fourth, a state machine can restrict state transitions to one direction, such as pending payment to paid to shipped, so repeated operations cannot roll the state backward. Fifth, a deduplication table can store request IDs and cached results, allowing repeated requests to return the same result.

In practice, a risk decision API often combines a request ID, a deduplication cache or table, and a distributed lock. The flow is to check whether the request has already been processed, acquire a lock for the request ID, double-check after acquiring the lock, execute the risk evaluation, store the decision, and return it. This handles both concurrent duplicate requests and retry scenarios. The idempotency key must be globally unique and should carry business meaning; simply hashing all request parameters can be risky because parameter order, optional fields, or non-business fields may change.

Q7. 微服务网关的作用？Spring Cloud Gateway 核心原理？

EN: What is the role of an API gateway? How does Spring Cloud Gateway work?

难度： ★★★ | 出现频率： 中高（美团、字节）

Key Terms: routing (路由转发), filter chain (过滤器链), predicate (断言/匹配条件), rate limiting (限流), authentication (认证鉴权), circuit breaker (熔断器), Netty (异步网络框架)

答案要点：

网关职责：路由转发、认证鉴权、限流熔断、日志监控、协议转换
Spring Cloud Gateway：基于 WebFlux + Netty，非阻塞异步模型。核心概念：Route（路由）= Predicate（匹配条件）+ Filter（过滤器链）
与风控集成：网关层做粗粒度风控（IP 黑名单、限流），业务层做细粒度风控（交易风险评估）

常见误区：

❌ 认为网关能替代所有安全防护 → ✅ 网关只做粗粒度防护（IP 黑名单、全局限流），细粒度的业务风控仍需在业务层实现
❌ Believing an API gateway can replace all security measures → ✅ Gateways only handle coarse-grained protection (IP blacklists, global rate limits); fine-grained business risk control must reside in the business layer
❌ 认为 Spring Cloud Gateway 和 Zuul 1.x 原理相同 → ✅ Zuul 1.x 是基于 Servlet 的同步阻塞模型，Gateway 基于 WebFlux + Netty 的异步非阻塞模型，性能差距显著
❌ Assuming Spring Cloud Gateway works the same as Zuul 1.x → ✅ Zuul 1.x uses a synchronous Servlet-based blocking model, while Gateway uses an async non-blocking WebFlux + Netty model with significantly better performance
❌ 把所有逻辑都塞到网关的 Filter 中 → ✅ 网关 Filter 应保持轻量，复杂业务逻辑应下沉到业务服务
❌ Stuffing all logic into gateway filters → ✅ Gateway filters should stay lightweight; complex business logic belongs in downstream services

延伸追问：

Spring Cloud Gateway 的 Filter 链的执行顺序是怎样的？Global Filter 和 Gateway Filter 的区别？
网关如何实现灰度发布？基于权重路由的原理是什么？
网关层的限流和业务层的限流分别适合做什么粒度的控制？
What is the execution order of Spring Cloud Gateway's filter chain? How do Global Filters differ from Gateway Filters?
How does an API gateway implement canary deployments? What is the principle behind weight-based routing?

风控关联：

网关层做粗粒度风控（IP 黑名单、全局限流、地域封禁），业务层做细粒度风控（交易风险评估、商户准入）
The gateway layer handles coarse-grained risk control (IP blacklists, global rate limiting, geo-blocking), while the business layer performs fine-grained assessment (transaction risk scoring, merchant onboarding)
关联实时风控引擎

English Answer：

An API gateway is the entry point of a microservice system. Its responsibilities include routing requests to downstream services, authentication and authorization, rate limiting and circuit breaking, logging and monitoring, and protocol conversion. It should handle cross-cutting concerns, but it should not contain heavy business logic.

Spring Cloud Gateway is built on Spring WebFlux and Netty, so it uses an asynchronous non-blocking model. Its core concept is Route = Predicate + Filter. A Route defines where a request should go. A Predicate decides whether a request matches that route, for example by path, header, method, or weight. Filters form a filter chain and perform logic before or after forwarding, such as adding headers, authentication, logging, rate limiting, or fallback handling.

For risk control integration, the gateway should perform coarse-grained protection such as IP blacklists, global rate limiting, and geo-blocking. Fine-grained risk control, such as transaction risk scoring and merchant onboarding checks, should stay in business services because it requires domain context. This layered design keeps the gateway lightweight while still protecting the whole system.

Q8. 什么是服务降级和熔断？它们的区别？

EN: What are service degradation and circuit breaking? How do they differ?

难度： ★★★★ | 出现频率： 高（阿里、美团、字节）

Key Terms: circuit breaker (熔断器), fallback (降级回退), half-open (半开状态), Hystrix/Resilience4j/Sentinel (熔断框架), slow call ratio (慢调用比例), error ratio (错误比例)

答案要点：

熔断（Circuit Breaking）：当下游服务异常率超过阈值时自动切断调用，避免级联故障。状态机：Closed → Open → Half-Open
降级（Degradation）：熔断或限流触发后的兜底逻辑（返回默认值、走缓存、转人工）
区别：熔断是机制（保护系统），降级是策略（保证可用性）

常见误区：

❌ 混淆熔断和降级的概念 → ✅ 熔断是自动触发的保护机制（状态机驱动），降级是熔断/限流后的业务兜底策略
❌ Confusing circuit breaking with degradation → ✅ Circuit breaking is an automatically triggered protection mechanism (state-machine driven), while degradation is the business fallback strategy invoked after circuit breaking or rate limiting
❌ 认为熔断器一旦打开就永远不会关闭 → ✅ 熔断器有 Half-Open 状态，会定期放行探测请求，如果恢复则自动关闭
❌ Believing a circuit breaker, once open, stays open forever → ✅ Circuit breakers have a Half-Open state that periodically allows probe requests; if the downstream recovers, the breaker closes automatically
❌ 认为所有接口都需要熔断 → ✅ 核心链路和非核心链路策略不同，非核心服务可以直接降级而非熔断
❌ Assuming every endpoint needs circuit breaking → ✅ Core and non-core paths require different strategies — non-core services can be degraded directly without a full circuit breaker

延伸追问：

熔断器的三个状态（Closed/Open/Half-Open）的切换条件分别是什么？各项参数怎么配置？
Sentinel 和 Resilience4j 在熔断策略上有什么区别？各自适合什么场景？
熔断后的降级策略应该怎么设计？如果是核心交易接口，降级返回什么？
What are the state transition conditions for the three circuit breaker states (Closed/Open/Half-Open)? How should parameters be tuned?
How do you design a degradation strategy after circuit breaking? For a core transaction API, what should the fallback return?

风控关联：

风控引擎熔断后的降级策略：放行低风险交易 + 转人工审核高风险交易
The degradation strategy for the risk engine after circuit breaking: pass through low-risk transactions and route high-risk ones to manual review
关联实时风控引擎

English Answer：

Circuit breaking is a protection mechanism. When a downstream service's error ratio, slow call ratio, or failure count exceeds a configured threshold, the caller stops sending normal traffic to that service to prevent cascading failures. A circuit breaker usually has three states: Closed, Open, and Half-Open. Closed means normal calls are allowed. Open means calls are blocked or directly fall back. After a sleep window, the breaker enters Half-Open and allows a small number of probe requests. If the probes succeed, it closes again; if they fail, it opens again.

Degradation is the fallback strategy used after circuit breaking, rate limiting, or dependency failure. Common degradation strategies include returning a default value, reading cached data, using a simplified rule set, delaying processing, or routing the case to manual review. The core difference is that circuit breaking is the mechanism that protects the system, while degradation is the business strategy that preserves availability and user experience.

In risk control, the degradation strategy must be designed carefully. For example, after the risk engine is circuit-broken, low-risk transactions may be passed through based on cached or simplified rules, while high-risk transactions should be routed to manual review rather than blindly approved. Core transaction APIs should not return a fake success if doing so would create fund loss or compliance risk.

Q9. 分布式 Session 怎么实现？

EN: How do you manage sessions in a distributed environment?

难度： ★★★ | 出现频率： 中高（阿里、美团）

Key Terms: Spring Session (Spring会话管理), Redis (分布式缓存), JWT (JSON Web Token), sticky session (粘性会话), session replication (会话复制), token (令牌)

答案要点：

Redis 集中存储（推荐）：Spring Session + Redis，session 数据集中管理，所有节点共享
JWT 无状态：Token 自包含用户信息，不存服务端，适合微服务和移动端
粘性 Session：Nginx ip_hash，同一 IP 始终路由到同一节点（不推荐，无法水平扩展）
Session 复制：Tomcat 集群间同步 Session（性能差，不推荐）

常见误区：

❌ 认为 JWT 可以完全替代 Session → ✅ JWT 存在无法主动失效、Token 体积大、续期困难等问题，不适合所有场景
❌ Believing JWT can fully replace server-side sessions → ✅ JWT has drawbacks — tokens cannot be revoked easily, payloads grow large, and renewal is complex; it does not fit every scenario
❌ 认为粘性 Session 足够用 → ✅ 粘性 Session 在节点宕机时会丢失 Session，且不利于负载均衡
❌ Assuming sticky sessions are good enough → ✅ Sticky sessions lose data on node failure and hinder effective load balancing
❌ 把敏感信息直接放进 JWT payload → ✅ JWT payload 只是 Base64 编码而非加密，敏感信息应放在服务端 Session 中
❌ Putting sensitive data directly into the JWT payload → ✅ JWT payloads are merely Base64-encoded, not encrypted — sensitive information should stay server-side

延伸追问：

JWT 的续期方案有哪些？双 Token（Access Token + Refresh Token）机制是怎么工作的？
Spring Session + Redis 方案中，Session 过期和 Redis Key 过期是怎么同步的？
分布式 Session 在风控系统中有什么作用？如何结合风控做会话级风险评估？
What are the common JWT renewal strategies? How does the dual-token (Access Token + Refresh Token) mechanism work?
How does session expiry in Spring Session + Redis stay in sync with Redis key TTL?

风控关联：

风控系统中的会话风控（Session 风控）依赖分布式 Session，用于检测同一会话中的异常行为（如短时间内操作频率突变）
Session-level risk control relies on distributed sessions to detect anomalous behavior within a single session, such as sudden spikes in operation frequency
关联实时风控引擎

English Answer：

There are four common ways to handle sessions in a distributed environment. The recommended approach is centralized Redis storage, usually with Spring Session plus Redis. Session data is stored in Redis and shared by all application nodes, so the system can scale horizontally and a request can be routed to any node. JWT is a stateless approach: the token contains user claims and the server does not need to store session state. It works well for microservices and mobile clients, but token revocation, payload size, and renewal are harder to manage.

Sticky sessions route the same client, for example by Nginx ip_hash, to the same backend node. This is not recommended because a node failure can lose the session and it weakens load balancing. Session replication synchronizes sessions among Tomcat cluster nodes, but it has poor performance and does not scale well under high traffic.

In practice, Spring Session with Redis is usually the safest default for server-side web applications, while JWT is useful when stateless authentication is more important. Sensitive information should not be placed directly in a JWT payload because it is Base64-encoded, not encrypted. In risk control, distributed sessions support session-level risk assessment, such as detecting sudden operation-frequency spikes or abnormal behavior within the same session.

Q10. 如何实现一个分布式任务调度？

EN: How do you implement distributed task scheduling?

难度： ★★★★ | 出现频率： 中高（美团、字节）

Key Terms: XXL-Job (分布式任务调度平台), ElasticJob (弹性分布式任务框架), cron (定时表达式), sharding (分片), failover (故障转移), idempotency (幂等性)

答案要点：

XXL-Job：轻量级分布式任务调度平台，支持可视化管理、分片广播、失败重试、任务依赖
ElasticJob：当当开源，基于 ZooKeeper，弹性扩缩容
关键问题：任务幂等、分片策略、失败重试、任务超时、监控告警

常见误区：

❌ 认为 Quartz 集群模式就是分布式调度 → ✅ Quartz 集群通过数据库锁实现任务不重复执行，但没有分片能力，不适合大数据量并行处理
❌ Equating Quartz clustering with distributed scheduling → ✅ Quartz clusters use database locks to prevent duplicate execution but lack sharding, making them unsuitable for large-scale parallel processing
❌ 忽视任务的幂等性设计 → ✅ 分布式任务可能因重试、分片迁移等原因重复执行，必须保证幂等
❌ Overlooking task idempotency in design → ✅ Distributed tasks may execute multiple times due to retries or shard migration — idempotency is mandatory
❌ 认为定时任务只需要 cron 表达式就够了 → ✅ 分布式调度还需要考虑分片策略、失败转移、任务依赖、监控告警等工程问题
❌ Thinking cron expressions are all you need → ✅ Distributed scheduling also requires sharding strategies, failover, task dependencies, monitoring, and alerting

延伸追问：

XXL-Job 的分片广播模式是怎么工作的？分片参数怎么在业务代码中使用？
ElasticJob 的弹性扩缩容是怎么实现的？新节点加入后分片怎么重新分配？
分布式任务调度中，如何保证同一时刻只有一个节点执行某个任务？有哪些方案？
How does XXL-Job's sharding broadcast mode work? How do you use shard parameters in business code?
In distributed scheduling, how do you ensure only one node executes a given task at any moment? What are the available approaches?

风控关联：

风控模型日终批量（PSI 计算、模型评分更新、黑名单同步）用 XXL-Job 调度
Risk control end-of-day batch jobs (PSI calculation, model score updates, blacklist synchronization) are scheduled via XXL-Job
关联实时风控引擎

English Answer：

Distributed task scheduling can be implemented with a dedicated scheduling platform. XXL-Job is a lightweight distributed scheduler. It supports visual task management, sharding broadcast, failure retry, task dependencies, and operational monitoring. ElasticJob is an open-source framework from Dangdang. It is based on ZooKeeper and focuses on elastic scaling: when nodes are added or removed, task shards can be reassigned.

The key engineering problems are task idempotency, sharding strategy, failure retry, failover, timeout handling, monitoring, and alerting. Idempotency is especially important because a distributed task may run more than once due to retry, scheduler failover, or shard migration. Sharding decides how a large job is split across workers. Failure retry and timeout handling prevent silent data loss or permanently stuck jobs. Monitoring and alerting are needed because scheduled jobs often run outside the normal request path and failures may otherwise be noticed too late.

Quartz clustering can prevent duplicate execution through database locks, but it does not provide strong sharding capability, so it is not ideal for large-scale parallel processing. In risk control systems, XXL-Job is commonly used for end-of-day batch jobs such as PSI calculation, model score updates, and blacklist synchronization.

关联

Redis — 分布式锁和 Session 的核心组件
消息队列 — 最终一致性的关键基础设施
MySQL — 分布式事务的数据库层面保障
实时风控引擎 — 风控微服务架构的核心设计参考