Skip to content

Commit

Permalink
[docs add]敏感词过滤
Browse files Browse the repository at this point in the history
  • Loading branch information
Snailclimb committed Jan 13, 2022
1 parent 8759967 commit e37a10c
Show file tree
Hide file tree
Showing 6 changed files with 65 additions and 1 deletion.
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,12 @@ JVM 这部分内容主要参考 [JVM 虚拟机规范-Java8 ](https://docs.oracle

数据脱敏说的就是我们根据特定的规则对敏感信息数据进行变形,比如我们把手机号、身份证号某些位数使用 * 来代替。

#### 敏感词过滤



https://github.com/toolgood/ToolGood.Words

### 定时任务

最近有朋友问到定时任务相关的问题。于是,我简单写了一篇文章总结一下定时任务的一些概念以及一些常见的定时任务技术选型:[《Java定时任务大揭秘》](./docs/system-design/定时任务.md)
Expand Down Expand Up @@ -353,7 +359,9 @@ Dubbo 是一款国产的 RPC 框架,由阿里开源。相关阅读:

**一旦用户的请求超过某个时间得不到响应就结束此次请求并抛出异常。** 如果不进行超时设置可能会导致请求响应速度慢,甚至导致请求堆积进而让系统无法在处理请求。

另外,重试的次数一般设为 3 次,再多次的重试没有好处,反而会加重服务器压力(部分场景使用失败重试机制会不太适合)。
重试的次数一般设为 3 次,再多的重试次数没有好处,反而会加重服务器压力(部分场景使用失败重试机制会不太适合)。在一次重试失败之后通常会加上一个时间间隔 delay 再进行下一次重试,时间间隔 delay 通常建议是随机的。

并且,为了更好地保护下游,我们还可以结合断路器。

### 灾备设计和异地多活

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<mxfile host="Electron" modified="2022-01-13T05:13:27.300Z" agent="5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/13.4.5 Chrome/83.0.4103.122 Electron/9.1.0 Safari/537.36" etag="sbqmP3PZO11fwB39pEgg" version="13.4.5" type="device"><diagram id="F1ZqUBVQB0ptfqIP5Zs4" name="Page-1">3Vrfc6M2EP5reLQHIaEfj9hO0rmZu16b5to8dQjINhds5QiO7f71FSBsJHDO5Q5DPX4wWkkrtN+n3ZWQBaer3V3ivyw/ipDHlmOHOwvOLMcBDEL5l0n2hYQxUggWSRSqRkfBffQPV0JbSTdRyF+1hqkQcRq96MJArNc8SDWZnyRiqzebi1gf9cVf8JrgPvDjuvTPKEyXhZQ65Cj/hUeLZTkywKyoWfllYzWT16Ufim1FBG8sOE2ESIun1W7K48x4pV2Kfrcnag8vlvB1ek6H4I+QPt6Jr796n8JPf29Hd9uH9UhpefPjjZrw79kLFS+c7ksrSE3S4LIw2S6jlN+/+EFWs5WYS9kyXcWyBOTjXKxTBWI2w8k8iuOpiEWS64FzGvAgkPLXNBHPvFLzRF3k2llNaaa88MzTYKl01ydcvj1PUr6riJQB7rhY8TTZyyaq1mGo6KLYeKDZ9ogtKAFbVnClSuYrOi0Oqo8Wlw/K6P8BAKcGgHXDLG9iMdoZCqHP6bwRBRxQ/jTvGgWAbQ0FiPtGATaggKwJtdj0alFA9tBQQA0o4AwF6l4tClQHAbG+QXAbQZDrgLKrBWHk6Chg1DcKuIaCd7XWN8MBhn1bn9Ws/6Uz63MQupw0WZ9hAn3ceUoEdesT0rf1y5xMc0HE8myLgquFYQQMH0Ro7zjUNwd5VuRZ3hVnRUiHwbV7h6EhOa1Zfx162WZXltZizd+3uDROsv9L2S8vPKqWeWG2q1bN9mVpF6V5p7GrSo+VmmOnrFD2aYHUq9gkAX/HHGryqZ8sePqe2Yp2PNT293XcK7i6DbCWsoTHfhq96acCTVirET6LSM7sQCt4KtkuVRTzVr2qu3hTkWsoAoaiwjA1RTn1DtP+ATY2JOm9shEBcjYfs8JnnkTSCDzpkKPoTI46g+KokYahthQFwFB0aYo2bGF+jKIDcX3umbRCg6KVubcy2XAurYgRmE16ds2q+pasD8fXglff5Ys7KL6MDD9ECG1JmLHjUkBtQGXUBAhSXa3NxoxhwBiFSNLysmQiXZEJtIyiEFZd29gGcBCRFP8/XZ6RpGGnbbZnpI0uubDToz+Zp11whJ3JETwojpiJPGmbbpnn+OTS6Vb9tKxfX9ZvkgbPZOOw9qeukV1B09Gcy0ZspHvQdH0ds7Ecrmc2dsEsMDDKQDh2iR6eHDAmTks3hvN8DNmYEgCYQ6CmGiGZlNhIejcHESTTNnhZWjWdhRKLQsub1fl1JWfS7uA+i5WLuYNN2Jgip7rC7TGTO4a2qzxvZybI7Ze+07z0zwkzPYeVmo/AuL2PcG3dRzDd+1DDRVzWQzTdIXGzqwveNP9sMrUYySW31kQ+3OYS74P/5t9ld9wsB8fS+JOnRKMz/rbJLojlxB295sz1ZAOAX3Y5o8p6+bTI/7OhZtn3+htqyXUoxy+HKoaRbyB9p1e/USGdQqovFt3tqMVU9VFK5MfRYi2LgeR1RvZJ5mKiwI89VbGKwjA+5Q8TsVmHPFRrJPafeDzxg+dFLjcHPy5fgFW54hYn+S/XmUraimzsEezmq6l+kcxtukhGGxaWmZy3cY3829flbfzw8eHmt3jEVx8E2oUNN/lysuGMbAMOUDWTNwBzEgVzD3XRANWIws+OT+9EmBb7nRNB5DsR6TTd+goryExMYOuks6vzFVk8Xu0tmh8vSMObfwE=</diagram></mxfile>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions docs/system-design/security/sentive-words-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
我们的系统需要对用户输入的文本进行敏感词过滤如色情、政治、暴力相关的词汇。

敏感词过滤用的使用比较多的 **Trie 树算法****DFA 算法**

## Trie 树

**Trie 树** 也称为字典树、单词查找树,哈系数的一种变种,通常被用于字符串匹配,用来解决在一组字符串集合中快速查找某个字符串的问题。像浏览器搜索的关键词提示一般就是基于 Trie 树来做的。

![](./images/sentive-words-filter/brower-trie.png)

假如我们的敏感词库中有以下敏感词:

- 高清有码
- 高清AV
- 东京冷
- 东京热

我们构造出来的敏感词Trie树就是下面这样的:

![](./images/sentive-words-filter/trie.png)



当我们要查找对应的字符串“东京热”的话,我们会把这个字符串切割成单个的字符“东”、“京”、“热”,然后我们从 Trie 树的根节点开始匹配。

可以看出, **Trie 树的核心原理其实很简单,就是通过公共前缀来提高字符串匹配效率。**

[Apache Commons Collecions](https://mvnrepository.com/artifact/org.apache.commons/commons-collections4) 这个库中就有 Trie 树实现:

![](./images/sentive-words-filter/common-collections-trie.png)

```java
Trie<String, String> trie = new PatriciaTrie<>();
trie.put("Abigail", "student");
trie.put("Abi", "doctor");
trie.put("Annabel", "teacher");
trie.put("Christina", "student");
trie.put("Chris", "doctor");
Assertions.assertTrue(trie.containsKey("Abigail"));
assertEquals("{Abi=doctor, Abigail=student}", trie.prefixMap("Abi").toString());
assertEquals("{Chris=doctor, Christina=student}", trie.prefixMap("Chr").toString());
```

Aho-Corasick(AC)自动机是一种建立在Trie 树上的一种改进算法,是一种多模式匹配算法,由贝尔实验室的研究人员Alfred V. Aho 和 Margaret J.Corasick 发明。

AC自动机算法使用Trie 树来存放模式串的前缀,通过失败匹配指针(失配指针)来处理匹配失败的跳转。

相关阅读:[地铁十分钟 | AC自动机](https://zhuanlan.zhihu.com/p/146369212)

## DFA

DFA(Deterministic Finite Automaton)即确定有穷自动机。



0 comments on commit e37a10c

Please sign in to comment.