风控技术架构题
面向风控系统开发岗位,覆盖实时风控引擎/规则引擎/特征平台/CEP/高可用等核心架构设计题。
每道题包含中英双语答案、代码示例、常见误区和风控关联。
相关页面: 业务风控场景题 | 风控模型策略题 | 实时风控引擎 | 特征平台
Q1. 设计一个实时风控决策引擎,要求 SLA < 50ms,日请求量 10 亿,怎么做?
EN: Design a real-time risk decision engine with SLA < 50ms and 1 billion daily requests.
难度: ★★★★★ | 出现频率: 极高(蚂蚁、美团、字节、京东)
Key Terms: decision engine (决策引擎), rule engine (规则引擎), feature service (特征服务), SLA (服务等级协议), real-time scoring (实时评分), hot path (热路径), cold path (冷路径)
答案要点:
请求生命周期:
请求 → 接入层(网关鉴权+限流)→ 预处理(参数校验+设备指纹)
→ 特征计算(实时特征+离线特征合并)→ 规则引擎(规则树/评分卡/模型)
→ 决策输出(PASS/REJECT/REVIEW)→ 后处理(日志+审计+通知)
关键架构决策:
- 接入层:Nginx + Spring Cloud Gateway,IP 黑名单 + 滑动窗口限流(Sentinel)
- 特征服务:Feast/TFServing 特征平台,Redis 存实时特征(SLA < 5ms),HBase 存离线特征
- 规则引擎:Drools 或自研 Rete 算法引擎,热更新规则无需重启
- 模型服务:XGBoost/ONNX Runtime 低延迟推理,模型预加载到内存
- 决策流:规则树(快速过滤)→ 评分卡(风险量化)→ ML 模型(深度评估),逐层递进
- 异步化:审计日志、特征回写、通知走 Kafka 异步处理,不阻塞主链路
- 高可用:多机房部署 + 降级策略(规则引擎不可用时走轻量静态规则)
常见误区:
- ❌ 所有逻辑同步串行执行,导致延迟叠加 → ✅ 特征并行获取 + 非关键路径异步化,关键路径只保留规则和模型推理
- ❌ Running all logic synchronously in series, causing latency accumulation → ✅ Parallel feature fetching + async non-critical paths; keep only rules and model inference on the critical path
- ❌ 每次请求都调用模型推理 → ✅ 规则树快速过滤明显正常/异常请求,仅灰色地带走模型,减少模型调用量
- ❌ Calling the ML model on every single request → ✅ Use a rule tree to quickly filter obvious normal/anomalous requests; only route borderline cases to the model
- ❌ 不做降级预案 → ✅ 分层降级策略(模型 → 规则 → 静态名单 → 兜底放行),确保极端情况下系统不雪崩
- ❌ No degradation contingency plan → ✅ Tiered fallback strategy (model → rules → static lists → default pass) to prevent cascading failures under extreme conditions
延伸追问:
- 如何保证特征服务本身的高可用?特征服务宕机时怎么处理?
- 决策引擎如何做到水平扩展?有状态部分(如滑动窗口)怎么处理?
- SLA < 50ms 的测量口径是什么?P99 还是 P999?
- How do you ensure the feature service itself is highly available? What happens when it goes down?
- How does the decision engine scale horizontally? How do you handle stateful components like sliding windows?
- What is the SLA measurement criteria for < 50ms — P99 or P999?
风控关联:
- 实时风控决策引擎是整个风控系统的核心枢纽,直接决定交易/登录/营销等场景的拦截效果和用户体验
- The real-time risk decision engine is the central hub of the entire risk control system, directly determining interception effectiveness and user experience across transaction, login, and marketing scenarios
- 关联 实时风控引擎
English Answer:
- I would design this as a layered real-time decision engine and keep the hot path extremely short. A request enters through the access layer, where Nginx and Spring Cloud Gateway handle authentication, protocol adaptation, IP blacklist checks, and sliding-window rate limiting through Sentinel. Then it goes through preprocessing, including parameter validation, device fingerprint extraction, and basic request normalization.
- The feature layer should return data within a few milliseconds. Real-time features such as recent transaction count or recent amount are stored in Redis with a target SLA below 5ms. Offline features such as user profile, historical behavior, and long-window statistics come from HBase or a feature store such as Feast. Feature fetching should be batched and parallelized.
- The decision layer combines rules, scorecards, and models. The rule engine can be Drools or a custom Rete-based engine, and it must support hot updates without restarting the service. XGBoost or ONNX Runtime models should be preloaded in memory to avoid cold-start latency. The decision flow should be progressive: rule tree for obvious pass/reject cases, scorecard for risk quantification, and ML models for deeper evaluation of gray-area traffic.
- Non-critical work must be asynchronous. Audit logs, feature write-back, notifications, and downstream analytics should go through Kafka rather than blocking the decision response. For high availability, the system should be deployed across multiple data centers and have tiered fallback: if the model is unavailable, use rules and scorecards; if the rule engine is unavailable, use static lists and lightweight default rules. The target is to keep the critical path under 50ms while preserving basic risk-control capability during failures.
Q2. 规则引擎怎么选型?Drools vs 自研?
EN: How do you choose a rule engine? Compare Drools with a custom-built solution.
难度: ★★★★★ | 出现频率: 极高(蚂蚁、美团、字节)
Key Terms: Rete algorithm (Rete 算法), Drools, rule DSL (规则领域语言), hot reload (热更新), performance (性能), maintainability (可维护性)
答案要点:
| 维度 | Drools | 自研规则引擎 |
|---|---|---|
| 性能 | Rete 算法优化,适合大量规则 | 可针对场景极致优化 |
| 上手成本 | 学习 DRL 语法 | 需要设计 DSL |
| 热更新 | 支持(KnowledgeBase 刷新) | 需自行实现 |
| 灵活性 | 受限于 DRL 语法 | 完全可控 |
| 社区生态 | 成熟,文档完善 | 需维护团队 |
选型建议:
- 规则数 > 1000,复杂逻辑多 → Drools
- 规则数 < 500,性能要求极致 → 自研(基于决策树/规则链)
- 中间地带 → 自研轻量 DSL + Drools 混合
自研核心设计:
规则定义(JSON/YAML DSL)
↓
规则编译(DSL → 可执行逻辑)
↓
规则加载(热更新,版本管理)
↓
规则执行(责任链/决策树/Rete)
↓
规则监控(命中率/耗时/异常告警)
常见误区:
- ❌ 一开始就自研规则引擎 → ✅ 先用 Drools 验证业务,规则规模和复杂度明确后再考虑自研
- ❌ Building a custom rule engine from day one → ✅ Start with Drools to validate the business; consider building custom only after rule scale and complexity are well understood
- ❌ 规则热更新只考虑加载,不考虑回滚 → ✅ 每次规则发布保留版本快照,支持一键回滚到上一版本
- ❌ Only handling hot-reload without rollback capability → ✅ Keep version snapshots for every rule release and support one-click rollback to the previous version
延伸追问:
- 规则引擎的规则冲突怎么解决?优先级如何定义?
- 自研规则引擎的 DSL 如何保证业务人员也能理解和修改?
- How do you resolve rule conflicts in a rule engine? How is priority defined?
- How do you design a DSL that business analysts can understand and modify without engineering help?
风控关联:
- 规则引擎是风控系统的核心执行层,直接决定风控策略的迭代速度和灵活性
- The rule engine is the core execution layer of a risk control system, directly determining strategy iteration speed and operational flexibility
- 选型决策影响风控团队自主运营规则的能力,进而影响风控响应速度
- The build-vs-buy decision impacts how autonomously the risk team can manage rules, which in turn affects response time to emerging threats
- 关联 实时风控引擎
English Answer:
- I would choose based on rule volume, rule complexity, latency requirements, maintainability, and the team's ability to operate the engine. Drools is mature, has documentation and ecosystem support, and uses the Rete algorithm for efficient matching, so it is suitable when there are more than one thousand rules or many complex cross-condition rules. Its downside is DRL learning cost and limited flexibility once the business wants very customized execution semantics.
- A custom rule engine is better when the rule count is smaller, the execution path is very latency-sensitive, or the rules are simple enough to be expressed through a business-friendly DSL. The team must build the DSL, compiler, hot-loading mechanism, version management, rollback, monitoring, and conflict-resolution logic itself, so the maintenance cost is real.
- A practical custom design starts with JSON or YAML rule definitions, compiles the DSL into executable expressions or rule nodes, loads rules with version control and hot update, executes them through a responsibility chain, decision tree, or Rete-like matcher, and continuously monitors hit rate, latency, error rate, and abnormal rule behavior.
- My usual recommendation is phased. Use Drools or an existing engine to validate business logic early. If rule scale, latency bottlenecks, or operational needs become clear, evolve into a lightweight custom DSL or a hybrid model: custom DSL for simple high-frequency rules, Drools for complex expert rules.
Q3. 特征平台怎么设计?离线特征和实时特征怎么合并?
EN: How do you design a feature platform? How do you merge offline and real-time features?
难度: ★★★★★ | 出现频率: 极高(蚂蚁、美团、字节、同盾)
Key Terms: feature store (特征存储), online serving (在线服务), offline computation (离线计算), point-in-time correctness (时间点正确性), feature join (特征关联), Feast, T+1
答案要点:
架构分层:
- 离线特征层:Hadoop/Spark 批量计算(用户画像、历史统计),T+1 写入 HBase/Redis
- 实时特征层:Flink 流式计算(最近 N 分钟交易金额、频率),秒级写入 Redis
- 特征服务层:统一 API,根据请求从 Redis/HBase 获取特征并合并
- 特征注册中心:特征元数据管理(名称/类型/来源/SLA/所有者)
离线 + 实时合并策略:
public FeatureVector getFeatures(String userId, List<String> featureNames) {
FeatureVector vector = new FeatureVector();
// 1. 批量从 Redis 获取实时特征(SLA < 5ms)
Map<String, Object> realtime = redisFeatureService.batchGet(userId, featureNames);
// 2. 实时特征缺失的,从 HBase 补充离线特征
List<String> missing = featureNames.stream()
.filter(f -> !realtime.containsKey(f))
.collect(Collectors.toList());
if (!missing.isEmpty()) {
Map<String, Object> offline = hbaseFeatureService.batchGet(userId, missing);
vector.merge(realtime).merge(offline);
}
return vector;
}
Point-in-Time Correctness:训练时用特征时间戳对齐,防止特征穿越(用未来数据训练)
常见误区:
- ❌ 离线训练和在线推理用不同的特征计算逻辑,导致训练-推理偏差(training-serving skew)→ ✅ 离线和在线共享同一套特征定义和计算逻辑,通过特征注册中心统一管理
- ❌ Using different feature computation logic for offline training and online inference, causing training-serving skew → ✅ Share the same feature definitions and computation logic across offline and online, managed through a centralized feature registry
- ❌ 不考虑 Point-in-Time Correctness,导致特征穿越 → ✅ 训练时按事件时间戳对齐特征快照,严格防止未来信息泄露
- ❌ Ignoring Point-in-Time Correctness, allowing feature leakage → ✅ Align feature snapshots by event timestamp during training to strictly prevent future information from leaking into the model
- ❌ 实时特征和离线特征串行获取 → ✅ 先并行批量获取实时特征,缺失部分再补充离线特征,减少延迟
- ❌ Fetching real-time and offline features sequentially → ✅ Batch-fetch real-time features in parallel first, then backfill missing ones from offline storage to minimize latency
延伸追问:
- 特征平台的特征血缘追踪怎么做?如何追踪一个特征从定义到上线到废弃的全生命周期?
- 特征覆盖率不足时怎么处理?有哪些特征填充(imputation)策略?
- How do you implement feature lineage tracking in a feature platform? How do you trace a feature's full lifecycle from definition to deployment to deprecation?
- What imputation strategies do you use when feature coverage is insufficient?
风控关联:
- 特征平台是风控模型和规则引擎的数据基础设施,特征质量直接决定风控效果
- The feature platform is the data backbone for risk models and rule engines — feature quality directly determines risk control effectiveness
- 实时特征用于捕捉即时风险行为(如短时间大额转账),离线特征提供用户历史基线
- Real-time features capture immediate risk behaviors (e.g., large transfers in a short window), while offline features provide a user's historical behavioral baseline
- 关联 特征平台
English Answer:
- I would design the feature platform in four layers. The offline feature layer uses Hadoop or Spark to compute user profiles, historical statistics, long-window behavior, and graph-derived features, then writes them to HBase, Redis, or an offline feature store on a T+1 schedule. The real-time feature layer uses Flink to compute short-window features such as transaction amount in the last N minutes, login failure count, or device-switch frequency, and writes them to Redis with second-level or sub-second freshness.
- The feature serving layer exposes a unified API. At request time, it batch-fetches real-time features from Redis first because they have the strictest latency target, usually below 5ms. If some features are missing or not real-time features, the service fetches offline values from HBase or the offline store and merges them into one feature vector. The service should support batch get, default values, timeout control, and feature-level degradation.
- The feature registry is the control plane. It records each feature's name, type, owner, source, refresh frequency, SLA, online/offline availability, default value, and deprecation status. It also prevents training-serving skew by making offline training and online inference use the same feature definitions.
- Point-in-Time Correctness is mandatory. During training, features must be joined according to the event timestamp, not the current latest value. Otherwise the model may learn from future information that was not available at decision time. This feature leakage can make offline metrics look good while online performance collapses.
- In production, I would also monitor feature coverage, missing rate, latency, freshness, and distribution drift. A feature platform is not only storage; it is the data contract between rules, models, and the decision engine.
Q4. Flink CEP 在风控中怎么用?举个例子?
EN: How do you use Flink CEP for risk control? Give an example.
难度: ★★★★☆ | 出现频率: 高(美团、字节、蚂蚁)
Key Terms: CEP (复杂事件处理), pattern detection (模式检测), complex event processing (复杂事件处理), event stream (事件流), window (窗口), Flink CEP
答案要点:
- CEP(复杂事件处理):在事件流中检测特定模式(如"5 分钟内同一用户登录失败 3 次 + 异地登录")
- Flink CEP 核心 API:
- Pattern 定义模式(事件序列 + 条件 + 时间窗口) - PatternStream 匹配事件流 - select/flatSelect 处理匹配结果
// 示例:检测"短时间多次登录失败"攻击模式
Pattern<LoginEvent, ?> attackPattern = Pattern.<LoginEvent>begin("start")
.where(e -> e.isSuccess() == false)
.timesOrMore(3) // 失败 3 次及以上
.within(Time.minutes(5)) // 5 分钟窗口内
.followedBy("suspicious")
.where(e -> e.isNewDevice()) // 新设备登录
.within(Time.minutes(10));
PatternStream<LoginEvent> patternStream = CEP.pattern(loginStream, attackPattern);
patternStream.select(matches -> {
LoginEvent first = matches.get("start").get(0);
// 触发风控告警
alertService.send(new AccountTakeoverAlert(first.getUserId()));
return null;
});
常见误区:
- ❌ CEP 模式定义过于复杂,嵌套层级太多 → ✅ 每个模式聚焦单一风险场景,多个简单模式组合优于一个复杂模式
- ❌ Defining overly complex CEP patterns with too many nested layers → ✅ Keep each pattern focused on a single risk scenario; combining multiple simple patterns is better than one complex pattern
- ❌ 忽略 CEP 窗口过期事件的处理 → ✅ 合理设置窗口超时和侧输出(side output),处理部分匹配的超时事件
- ❌ Ignoring expired events in CEP windows → ✅ Set appropriate window timeouts and use side outputs to handle partially matched timed-out events
延伸追问:
- CEP 模式的状态后端怎么选?大状态下的性能怎么保证?
- CEP 和规则引擎在风控场景中怎么分工?哪些场景用 CEP,哪些用规则引擎?
- How do you choose the state backend for CEP patterns? How do you maintain performance with large state?
- How do CEP and rule engines divide responsibilities in risk control? When should you use CEP vs. a rule engine?
风控关联:
- Flink CEP 是风控系统中复杂行为模式检测的核心技术,适用于账户盗用检测、交易欺诈识别等需要跨事件关联的场景
- Flink CEP is the core technology for complex behavioral pattern detection in risk control, ideal for scenarios requiring cross-event correlation such as account takeover detection and transaction fraud identification
- CEP 检测到的模式匹配结果可直接触发风控告警或接入决策引擎进行二次评估
- CEP pattern match results can directly trigger risk alerts or feed into the decision engine for secondary evaluation
- 关联 实时风控引擎
English Answer:
- Flink CEP is used to detect ordered behavioral patterns in an event stream. In risk control, many risks cannot be identified from a single event; they require a sequence, such as several failed logins followed by a new-device login, or repeated small payments followed by a large transfer. CEP is suitable for this kind of cross-event correlation with time windows.
- The core APIs are
Pattern,PatternStream, andselectorflatSelect.Patterndefines the event sequence, conditions, repetition count, and window.PatternStreamapplies the pattern to a keyed event stream, for example keyed by user ID or device ID.selectorflatSelectconverts matched event sequences into alerts, risk labels, or features that can enter the decision engine. - In the login example, the pattern is "three or more failed logins within five minutes, followed by a login from a new device within ten minutes." Once matched, the system sends an account-takeover alert or increases the user's risk score. The matched result can also trigger step-up verification rather than an immediate hard block.
- Production design must control state size and pattern complexity. Each pattern should focus on one risk scenario; combining several simple patterns is easier to maintain than one deeply nested pattern. Window timeout, late events, side outputs, and state backend choice must be designed carefully, otherwise CEP state can grow too large and hurt latency.
Q5. 风控系统怎么保证高可用?降级策略怎么设计?
EN: How do you ensure high availability for a risk control system? How do you design degradation strategies?
难度: ★★★★★ | 出现频率: 极高(蚂蚁、美团)
Key Terms: degradation (降级), circuit breaker (熔断器), fallback (兜底), chaos engineering (混沌工程), multi-datacenter (多机房), disaster recovery (容灾)
答案要点:
降级分层策略:
- L0(正常):全链路服务(实时特征 + 规则引擎 + ML 模型)
- L1(模型降级):ML 模型不可用 → 仅走规则引擎 + 评分卡
- L2(特征降级):实时特征不可用 → 用离线特征 + 默认值
- L3(规则降级):规则引擎不可用 → 静态黑名单 + 基本规则
- L4(兜底):全链路不可用 → 放行低风险 + 转人工
自动降级触发:
- 响应时间 > SLA 阈值(如 200ms)→ 跳过 ML 模型
- 错误率 > 1% → 降级到静态规则
- Kafka 积压 > 10 万条 → 启动批量消费者
常见误区:
- ❌ 降级策略只在出问题时才测试 → ✅ 通过混沌工程(Chaos Engineering)定期演练降级,确保预案有效
- ❌ Only testing degradation strategies during actual incidents → ✅ Use chaos engineering to regularly drill degradation scenarios and validate contingency plans
- ❌ 降级是全有或全无的开关 → ✅ 分层逐级降级,每一层只牺牲必要的精度,最大化保留风控能力
- ❌ Treating degradation as an all-or-nothing switch → ✅ Tiered gradual degradation — each level only sacrifices necessary precision to maximize retained risk control capability
- ❌ 降级触发后忘记自动恢复 → ✅ 降级和恢复都需要自动化机制,持续探测上游服务健康状态
- ❌ Forgetting to auto-recover after degradation is triggered → ✅ Both degradation and recovery need automated mechanisms with continuous upstream health probing
延伸追问:
- 降级状态的监控和告警怎么做?如何确保运维团队第一时间感知到降级发生?
- 多机房部署时,跨机房切换的 RTO 和 RPO 目标怎么设定?
- How do you monitor and alert on degradation state? How do you ensure the ops team is immediately aware when degradation occurs?
- How do you set RTO and RPO targets for cross-datacenter failover?
风控关联:
- 风控系统的高可用直接关系到业务安全水位,降级期间的风险敞口需要有明确评估和补偿机制
- High availability of the risk control system directly impacts business security posture; risk exposure during degradation periods must be explicitly assessed and compensated
- 降级策略设计需权衡安全性和可用性,核心原则是"宁可误杀不可漏放"还是"宁可漏放不可误杀"取决于业务场景
- Degradation strategy design requires balancing security vs. availability — whether to prefer false positives or false negatives depends on the specific business scenario
- 关联 实时风控引擎
English Answer:
- I would guarantee high availability through redundant deployment, multi-datacenter disaster recovery, dependency isolation, and a tiered degradation strategy. In the normal L0 state, the full chain runs: real-time features, rule engine, scorecards, and ML models.
- At L1, if the ML service is unavailable or exceeds the latency threshold, the engine skips model inference and uses rules plus scorecards. At L2, if real-time features are unavailable, it uses offline features and safe default values. At L3, if the rule engine is unavailable, it falls back to static blacklists, whitelists, and basic rules. At L4, if the whole risk chain is unavailable, low-risk traffic may be passed while suspicious or high-risk traffic is routed to manual review or conservative rejection, depending on business risk appetite.
- Degradation should be triggered automatically by measurable signals: response time above SLA, error rate above threshold, dependency timeout, Redis or feature-service failure, Kafka lag above a threshold such as 100,000 messages, or abnormal model inference latency. Recovery should also be automatic but conservative, with health probing and gradual traffic restoration.
- Degradation is not an all-or-nothing switch. Each level should sacrifice only the minimum necessary precision while preserving as much risk-control capability as possible. The fallback path must be tested through chaos engineering drills; otherwise it is only a document, not a reliable mechanism.
Q6. 规则版本管理和 AB 实验怎么做?
EN: How do you manage rule versions and run A/B experiments?
难度: ★★★★☆ | 出现频率: 高(蚂蚁、美团、字节)
Key Terms: champion-challenger (冠军挑战者), canary release (金丝雀发布), traffic splitting (流量分桶), A/B testing (AB 实验), rule versioning (规则版本管理), shadow mode (影子模式)
答案要点:
- 规则版本化:每条规则有 version + status(draft/active/archived),发布走审批流
- AB 实验(冠军挑战者):
- 灰度发布:新规则先在 shadow mode 运行(不实际拦截,只记录匹配结果),验证无误后上线
- 流量分桶:按 userId hash 分桶,确保同一用户始终走同一规则集
- 90% 流量走 Champion(当前规则集) - 10% 流量走 Challenger(新规则集) - 对比通过率、捕获率、误报率等指标
常见误区:
- ❌ 新规则直接全量上线,没有灰度验证 → ✅ 先 shadow mode 跑一段时间观察匹配结果,确认无误后再小流量上线
- ❌ Rolling out new rules to 100% traffic without canary validation → ✅ Run in shadow mode first to observe match results, then gradually increase traffic after validation
- ❌ AB 实验的分桶不固定,同一用户在不同时间走不同规则集 → ✅ 按 userId hash 分桶,确保实验期间用户始终走同一规则集,保证实验结果可比较
- ❌ Using non-sticky bucketing in A/B tests, causing the same user to hit different rule sets over time → ✅ Bucket by userId hash to ensure consistent rule set assignment throughout the experiment for comparable results
延伸追问:
- AB 实验的统计显著性怎么判断?样本量需要多大才能得出可靠结论?
- 如果 Challenger 规则集在线上出现误杀率飙升,怎么快速止损和回滚?
- How do you determine statistical significance in A/B experiments? What sample size is needed for reliable conclusions?
- If the challenger rule set shows a spike in false positives in production, how do you quickly stop the bleeding and roll back?
风控关联:
- 规则版本管理和 AB 实验是风控策略迭代的工程保障,直接决定策略上线的安全性和可回滚性
- Rule versioning and A/B testing provide the engineering safeguards for risk strategy iteration, directly determining the safety and rollback capability of strategy deployments
- shadow mode 是风控策略验证的最佳实践,可以在不影响线上用户的情况下评估新规则效果
- Shadow mode is the best practice for validating risk strategies — it evaluates new rule effectiveness without impacting live users
- 关联 实时风控引擎
English Answer:
- Rule version management starts with lifecycle control. Every rule should have a version, status, owner, effective time, expiration time, change reason, and rollback target. Typical statuses are draft, active, paused, and archived. Publishing should go through an approval workflow, and every release should keep a snapshot so the system can roll back quickly.
- Before a new rule affects users, I would run it in shadow mode. In shadow mode, the rule is evaluated and its hit result is logged, but it does not actually block or change the decision. This lets the team estimate hit rate, false positives, affected users, and expected business impact without harming online traffic.
- For A/B testing, I would use a champion-challenger framework. The champion is the current stable rule set; the challenger is the new rule set. A common split is 90% champion and 10% challenger at first, then gradually increase the challenger if metrics are healthy. Metrics include pass rate, fraud catch rate, false positive rate, manual review rate, user complaint rate, and net loss reduction.
- Traffic bucketing must be sticky. Bucketing by
userIdhash ensures the same user stays in the same rule group throughout the experiment, so the comparison is meaningful. If the challenger causes a spike in false positives or latency, the platform must support one-click rollback or immediate traffic cutback. - A complete versioning platform should also support audit logs, rule conflict detection, priority management, gray release, and experiment reports. This is the engineering basis for safe risk-strategy iteration.
Q7. Java 技术栈在风控中有哪些典型应用?
EN: What are typical Java technology applications in risk control systems?
难度: ★★★★☆ | 出现频率: 高(蚂蚁、美团、字节)
Key Terms: Spring Boot, Kafka, Guava Cache (Guava 缓存), Caffeine, Disruptor (高性能内存队列), Netty, CompletableFuture (异步编排), ONNX Runtime
答案要点:
- Spring Boot 微服务:风控决策引擎、特征服务、规则管理平台
- Kafka Client:事件消费、决策结果推送
- Guava Cache / Caffeine:本地缓存热点规则和特征,减少 Redis 调用
- Disruptor:高性能内存队列,风控引擎内部事件流转
- Netty:风控接入层的高性能网络通信
- CompletableFuture:多特征源并行获取
- ONNX Runtime (Java):模型推理(Java 调用 ONNX Runtime 跑 XGBoost 模型)
常见误区:
- ❌ 所有特征都从 Redis 获取,忽略本地缓存 → ✅ 热点规则和特征用 Caffeine 本地缓存,减少网络调用,降低延迟
- ❌ Fetching all features from Redis and ignoring local caching → ✅ Use Caffeine local cache for hot rules and features to reduce network calls and lower latency
- ❌ 特征获取串行调用,浪费等待时间 → ✅ 用 CompletableFuture 并行获取多个特征源,大幅降低总延迟
- ❌ Fetching features sequentially, wasting wait time → ✅ Use CompletableFuture to parallelize feature fetching from multiple sources, significantly reducing overall latency
延伸追问:
- Disruptor 和 Kafka 在风控引擎中分别适合什么场景?为什么风控引擎内部用 Disruptor 而不是 BlockingQueue?
- ONNX Runtime Java 版本的推理性能和 Python 版本相比有什么差异?怎么优化?
- When should you use Disruptor vs. Kafka in a risk engine? Why choose Disruptor over BlockingQueue for internal event processing?
- How does ONNX Runtime Java inference performance compare to the Python version? What optimization strategies exist?
风控关联:
- Java 是国内风控系统的主流技术栈,掌握 Java 在风控中的典型应用场景是面试加分项
- Java is the dominant technology stack for risk control systems in China; demonstrating knowledge of Java's typical applications in risk control is a strong interview differentiator
- 并发编程(CompletableFuture、Disruptor)和高性能网络通信(Netty)是风控引擎对低延迟要求的关键技术
- Concurrent programming (CompletableFuture, Disruptor) and high-performance networking (Netty) are key enablers for the low-latency requirements of risk decision engines
- 关联 并发编程
English Answer:
- Java is widely used in risk-control systems because the ecosystem fits low-latency backend services, event processing, and operational platforms. Spring Boot is commonly used for decision-engine services, feature services, rule-management platforms, and strategy-operation backends.
- Kafka clients are used for consuming transaction, login, marketing, and device events, and for publishing decision results, audit logs, and downstream risk labels. Redis clients and local cache libraries such as Caffeine are used together: Redis stores distributed features or counters, while Caffeine caches hot rules and hot features locally to reduce network calls and P99 latency.
- For internal high-throughput event processing, Disruptor can be used as an in-memory ring-buffer queue when the engine needs extremely low latency and predictable allocation behavior. Netty is suitable for high-performance gateway or RPC access layers where connection handling and network throughput matter.
CompletableFutureis useful for parallel feature fetching. A decision engine often needs user profile, device fingerprint, blacklist status, and historical transaction features from different sources. Running those calls concurrently can reduce total latency compared with serial calls.- For model inference, Java can call ONNX Runtime directly to run XGBoost, LightGBM-converted, or neural-network models inside the JVM. This avoids crossing a Python service boundary on the critical path. The trade-off is that model format, warmup, memory usage, and version rollout must be managed carefully.
- The main point is not to list frameworks, but to explain where each one belongs: Spring Boot for services, Kafka for event flow, Caffeine/Redis for feature access, Disruptor/Netty for low-latency infrastructure, CompletableFuture for parallel orchestration, and ONNX Runtime for in-process model inference.
Q8. 风控数据管道怎么设计?
EN: How do you design a risk control data pipeline?
难度: ★★★★☆ | 出现频率: 高(蚂蚁、美团、同盾)
Key Terms: data pipeline (数据管道), Kafka, Flink, Spark, HDFS, Iceberg, feature storage (特征存储), data quality (数据质量)
答案要点:
数据采集(埋点 SDK/Kafka)→ 实时计算(Flink)→ 特征存储(Redis/HBase)
↓
实时特征服务
↓
数据湖(HDFS/Iceberg)→ 离线计算(Spark)→ 模型训练 → 模型部署
- 实时管道:埋点 → Kafka → Flink(特征计算 + CEP)→ Redis(实时特征)→ 决策引擎
- 离线管道:数据湖 → Spark(T+1 特征计算、模型训练)→ HBase(离线特征)→ 模型仓库
- 数据质量:实时监控特征覆盖率、缺失率、异常值
常见误区:
- ❌ 实时管道和离线管道各自独立,特征口径不一致 → ✅ 统一特征定义和计算逻辑,通过特征注册中心确保实时和离线特征口径一致
- ❌ Running real-time and offline pipelines independently with inconsistent feature definitions → ✅ Unify feature definitions and computation logic; use a feature registry to ensure online/offline feature consistency
- ❌ 忽略数据质量监控,上游数据异常直接流入特征 → ✅ 实时监控特征覆盖率、缺失率和异常值,设置告警阈值
- ❌ Ignoring data quality monitoring, allowing upstream anomalies to flow directly into features → ✅ Monitor feature coverage, missing rates, and anomalies in real-time with alerting thresholds
延伸追问:
- 数据管道的延迟怎么监控?从埋点采集到特征可用的端到端延迟怎么测量?
- 数据湖选型时,HDFS、Iceberg、Hudi 各有什么优劣?风控场景推荐哪个?
- How do you monitor data pipeline latency? How do you measure end-to-end latency from event ingestion to feature availability?
- When choosing a data lake, what are the trade-offs between HDFS, Iceberg, and Hudi? Which is recommended for risk control scenarios?
风控关联:
- 数据管道是风控系统的数据基座,数据质量直接决定特征质量和模型效果
- The data pipeline is the data foundation of a risk control system; data quality directly determines feature quality and model performance
- 实时管道的延迟直接影响风控系统的实时性,离线管道的准确性影响模型训练效果
- Real-time pipeline latency directly impacts the system's real-time responsiveness, while offline pipeline accuracy affects model training effectiveness
- 关联 特征平台
English Answer:
- I would design the risk-control data pipeline as two connected tracks: real-time and offline. The real-time track starts from SDK instrumentation, server logs, or business events, writes them to Kafka, then uses Flink for real-time feature computation and CEP pattern detection. The output is written to Redis or another online feature store so the decision engine can read features within a few milliseconds.
- The offline track starts from the data lake, such as HDFS, Iceberg, or Hudi. Spark computes T+1 historical features, builds training datasets, and supports model training. Offline features are written to HBase or an offline feature store, while trained models are registered in a model repository and deployed to the online inference layer.
- The two tracks must share feature definitions. If the real-time and offline pipelines independently implement the same feature, training-serving skew will appear. A feature registry should define feature name, owner, computation logic, data source, refresh frequency, default value, and online/offline serving mode.
- Data quality monitoring is part of the pipeline design. I would monitor feature coverage, missing rate, abnormal values, distribution drift, event-time delay, Kafka lag, Flink checkpoint health, and end-to-end latency from ingestion to feature availability. Bad upstream data should trigger alerts and, when necessary, feature-level degradation.
- For risk control, pipeline latency determines whether the system can react to fast attacks, while offline accuracy determines model quality. The pipeline should therefore support both fast online reaction and stable offline learning.
Q9. 如何用图分析(Graph Analytics)揭露欺诈网络?请描述技术方案。
EN: How would you use graph analytics to uncover fraud networks? Describe the technical approach.
难度: ★★★★★ | 出现频率: 高(蚂蚁、字节、美团、Grab)
Key Terms: graph analytics (图分析), Neo4j, community detection (社区发现), centrality (中心性), device fingerprint clustering (设备指纹聚类)
答案要点:
- 图建模:节点 = 用户/设备/IP/手机号/银行卡/收货地址;边 = 关联关系(同设备登录、同 IP 交易、同收货地址、资金转账)
- 欺诈团伙识别技术:
- 技术选型:
- 实战案例:
- 社区发现(Louvain / Label Propagation):识别紧密关联的子图 - 中心性分析(Degree / Betweenness / PageRank):定位团伙核心节点 - 图神经网络(GCN / GraphSAGE):学习节点的欺诈概率表示
- 存储:Neo4j(小规模)/ JanusGraph + HBase(大规模)/ NebulaGraph(国产替代) - 计算:Spark GraphX(离线)/ Flink Gelly(实时) - 查询:Cypher(Neo4j)/ Gremlin(JanusGraph)
- 批量注册检测:同 IP 段 + 同设备 + 注册时间窗口 → 图聚类 → 标记团伙 - 刷单检测:买家-卖家-商品三角关系 → 异常密集子图 → 评分
代码示例:
// Neo4j Cypher: 查找共享设备的欺诈团伙
MATCH (u1:User)-[:USED_DEVICE]->(d:Device)<-[:USED_DEVICE]-(u2:User)
WHERE u1 <> u2 AND u1.risk_score > 0.7
WITH u1, u2, collect(DISTINCT d.device_id) AS shared_devices
WHERE size(shared_devices) >= 2
RETURN u1.user_id, u2.user_id, shared_devices
ORDER BY size(shared_devices) DESC
LIMIT 50
常见误区:
- ❌ 认为图分析只是"画画线" → ✅ 图分析的核心价值在于通过社区发现和中心性计算识别隐含的团伙关系
- ❌ Thinking graph analytics is just "drawing lines" → ✅ Its core value is identifying hidden fraud-ring relationships through community detection and centrality computation
- ❌ 认为图数据库可以替代所有风控场景下的关系型数据库 → ✅ 图数据库是补充能力:关系密集型分析用图,交易型查询仍用 RDBMS
- ❌ Thinking graph databases replace relational databases for all risk scenarios → ✅ Graph is complementary: use it for relationship-heavy analysis, RDBMS for transactional queries
延伸追问:
- 图分析在实时场景下的延迟如何控制?
- 图神经网络的训练样本如何构造?标注成本如何?
- How do you control graph analytics latency in real-time scenarios?
- How do you construct training samples for Graph Neural Networks? How do you manage labeling cost?
- How do you handle graph visualization for analysts when the fraud network has 10,000+ nodes?
风控关联:
- 图分析是 反欺诈体系 的核心能力之一
- Graph analytics is one of the core capabilities of an anti-fraud system
- 与设备指纹、关联图谱技术直接相关
- It is directly related to device fingerprinting and relationship-graph technologies
English Answer:
- Graph analytics is used to uncover fraud rings that are hard to detect from single transactions. I would model users, devices, IPs, phone numbers, bank cards, shipping addresses, merchants, and orders as nodes. Edges represent relationships such as same-device login, same-IP transaction, shared address, shared payment instrument, fund transfer, or buyer-seller interaction.
- On top of the graph, I would use community detection algorithms such as Louvain or Label Propagation to find tightly connected subgraphs. I would use centrality metrics such as degree, betweenness, and PageRank to identify hub accounts, organizers, or bridge nodes. For more advanced detection, GCN or GraphSAGE can learn node embeddings and fraud probability representations from graph structure and labels.
- Technology choice depends on scale and query pattern. Neo4j is convenient for smaller graphs and interactive Cypher queries. JanusGraph with HBase or Cassandra is more suitable for larger graphs. NebulaGraph can be considered when a distributed graph database is needed. Spark GraphX is suitable for offline graph computation, while Flink-based graph processing can support near-real-time updates for selected features.
- Practical scenarios include batch registration detection, where many users share the same IP range, device, and registration window, and fake-order detection, where buyer-seller-product triangles form unusually dense subgraphs. The output can become graph features, risk labels, or analyst investigation leads.
- Graph analytics complements, rather than replaces, relational databases. It should be used for relationship-heavy analysis, while transactional reads and writes still belong in MySQL or other OLTP stores.
Q10. 设计百万级 TPS 的实时欺诈评分系统
EN: Design a real-time fraud scoring system handling millions of transactions per second.
难度: ★★★★★ | 出现频率: 极高(蚂蚁、美团、字节、Shopee)
Key Terms: TPS (每秒交易量), scoring pipeline (评分管道), rule layer / ML layer / deep analysis layer, feature store (特征存储)
答案要点:
- 三层评分架构:
- 高性能特征服务:
- 流量控制与降级:
- 容量规划:
- 规则层(0-50ms):黑白名单、频率控制、基础规则匹配,拦截 60-70% 明显欺诈 - ML 层(50-200ms):XGBoost/LightGBM 轻量模型,使用预计算特征 + 实时特征,处理 20-30% 灰色地带 - 深度分析层(200ms-2s):图分析、深度模型、序列模型,异步执行,结果用于后续拦截
- Feature Store:离线特征(T+1 Spark)+ 准实时特征(Flink 5min 窗口)+ 实时特征(Redis 滑动窗口) - 特征预加载:交易前预热用户特征到本地缓存
- Sentinel 限流 + 熔断 - 非核心规则动态开关(Apollo/Nacos 配置中心) - 大促模式:关闭非关键 ML 模型,纯规则拦截
- 100万 TPS ≈ 每秒 100万次评分请求 - 规则引擎单机 10万 QPS → 需要 10+ 节点集群 - Redis Cluster 特征查询 < 1ms → 横向扩展
常见误区:
- ❌ 追求所有交易都走深度模型 → ✅ 三层架构分级处理,80%+ 交易在规则层完成决策
- ❌ Trying to send every transaction through a deep model → ✅ Use the three-layer architecture for tiered processing, so 80%+ of transactions can be decided at the rule layer
- ❌ 试图用一个统一模型覆盖所有场景 → ✅ 针对支付、注册、登录、优惠券等场景使用专用模型,共享同一个特征平台
- ❌ Trying to build a single unified model for all scenarios → ✅ Use scenario-specific models (payment, registration, login, coupon) with a shared feature platform
延伸追问:
- 特征平台如何保证离线/在线特征一致性?
- 模型 A/B 实验如何在百万 TPS 下安全执行?
- How does the feature platform ensure consistency between offline and online features?
- How do you safely run model A/B experiments under million-level TPS?
- How would you handle a sudden 10x traffic spike during a promotion event?
风控关联:
- 这是 实时风控引擎 的核心设计题
- This is a core design question for a real-time risk decision engine
- 与 分布式系统 的限流/熔断/降级知识直接关联
- It is directly connected to rate limiting, circuit breaking, and degradation concepts in distributed systems
English Answer:
- A million-TPS fraud scoring system must be tiered. It should not send every transaction through expensive deep models. The first layer is the rule layer, usually within 0-50ms. It handles blacklists, whitelists, frequency controls, and simple rule matching, and should decide most obvious pass or reject traffic with very low latency.
- The second layer is the lightweight ML layer, usually within 50-200ms. XGBoost or LightGBM models use precomputed features plus current transaction fields to score gray-area traffic that rules cannot confidently decide. The third layer is the deep analysis layer, usually asynchronous and allowed to take 200ms to several seconds. It includes graph analytics, deep models, and sequence models, and feeds results back into later decisions or post-event enforcement.
- The feature service must be extremely fast. The Feature Store combines offline features from T+1 Spark jobs, near-real-time features from Flink five-minute windows, and real-time features from Redis sliding windows. For predictable high-value users or hot merchants, features can be preloaded into local cache before transactions to reduce online lookup latency.
- Traffic control and degradation are mandatory. Sentinel can provide rate limiting and circuit breaking, while Apollo or Nacos can dynamically turn non-core rules and models on or off. During major promotions, the system can disable non-critical ML models and keep only high-confidence rules on the hot path.
- Capacity planning must be quantitative. One million TPS means roughly one million scoring requests per second. If one rule-engine node handles 100K QPS, at least ten nodes are needed before redundancy; in practice, more are required for failover and traffic spikes. Redis Cluster must keep feature lookup below about 1ms and scale horizontally. The system should be verified by load tests at P50, P95, P99, and failure-mode scenarios.
关联
面经来源:FinalRound AI、InterviewPrep、Glassdoor