forked from Snailclimb/JavaGuide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8759967
commit e37a10c
Showing
6 changed files
with
65 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+11.4 KB
docs/system-design/security/images/sentive-words-filter/brower-trie.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+12.8 KB
.../system-design/security/images/sentive-words-filter/common-collections-trie.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
docs/system-design/security/images/sentive-words-filter/trie.drawio
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
<mxfile host="Electron" modified="2022-01-13T05:13:27.300Z" agent="5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/13.4.5 Chrome/83.0.4103.122 Electron/9.1.0 Safari/537.36" etag="sbqmP3PZO11fwB39pEgg" version="13.4.5" type="device"><diagram id="F1ZqUBVQB0ptfqIP5Zs4" name="Page-1">3Vrfc6M2EP5reLQHIaEfj9hO0rmZu16b5to8dQjINhds5QiO7f71FSBsJHDO5Q5DPX4wWkkrtN+n3ZWQBaer3V3ivyw/ipDHlmOHOwvOLMcBDEL5l0n2hYQxUggWSRSqRkfBffQPV0JbSTdRyF+1hqkQcRq96MJArNc8SDWZnyRiqzebi1gf9cVf8JrgPvDjuvTPKEyXhZQ65Cj/hUeLZTkywKyoWfllYzWT16Ufim1FBG8sOE2ESIun1W7K48x4pV2Kfrcnag8vlvB1ek6H4I+QPt6Jr796n8JPf29Hd9uH9UhpefPjjZrw79kLFS+c7ksrSE3S4LIw2S6jlN+/+EFWs5WYS9kyXcWyBOTjXKxTBWI2w8k8iuOpiEWS64FzGvAgkPLXNBHPvFLzRF3k2llNaaa88MzTYKl01ydcvj1PUr6riJQB7rhY8TTZyyaq1mGo6KLYeKDZ9ogtKAFbVnClSuYrOi0Oqo8Wlw/K6P8BAKcGgHXDLG9iMdoZCqHP6bwRBRxQ/jTvGgWAbQ0FiPtGATaggKwJtdj0alFA9tBQQA0o4AwF6l4tClQHAbG+QXAbQZDrgLKrBWHk6Chg1DcKuIaCd7XWN8MBhn1bn9Ws/6Uz63MQupw0WZ9hAn3ceUoEdesT0rf1y5xMc0HE8myLgquFYQQMH0Ro7zjUNwd5VuRZ3hVnRUiHwbV7h6EhOa1Zfx162WZXltZizd+3uDROsv9L2S8vPKqWeWG2q1bN9mVpF6V5p7GrSo+VmmOnrFD2aYHUq9gkAX/HHGryqZ8sePqe2Yp2PNT293XcK7i6DbCWsoTHfhq96acCTVirET6LSM7sQCt4KtkuVRTzVr2qu3hTkWsoAoaiwjA1RTn1DtP+ATY2JOm9shEBcjYfs8JnnkTSCDzpkKPoTI46g+KokYahthQFwFB0aYo2bGF+jKIDcX3umbRCg6KVubcy2XAurYgRmE16ds2q+pasD8fXglff5Ys7KL6MDD9ECG1JmLHjUkBtQGXUBAhSXa3NxoxhwBiFSNLysmQiXZEJtIyiEFZd29gGcBCRFP8/XZ6RpGGnbbZnpI0uubDToz+Zp11whJ3JETwojpiJPGmbbpnn+OTS6Vb9tKxfX9ZvkgbPZOOw9qeukV1B09Gcy0ZspHvQdH0ds7Ecrmc2dsEsMDDKQDh2iR6eHDAmTks3hvN8DNmYEgCYQ6CmGiGZlNhIejcHESTTNnhZWjWdhRKLQsub1fl1JWfS7uA+i5WLuYNN2Jgip7rC7TGTO4a2qzxvZybI7Ze+07z0zwkzPYeVmo/AuL2PcG3dRzDd+1DDRVzWQzTdIXGzqwveNP9sMrUYySW31kQ+3OYS74P/5t9ld9wsB8fS+JOnRKMz/rbJLojlxB295sz1ZAOAX3Y5o8p6+bTI/7OhZtn3+htqyXUoxy+HKoaRbyB9p1e/USGdQqovFt3tqMVU9VFK5MfRYi2LgeR1RvZJ5mKiwI89VbGKwjA+5Q8TsVmHPFRrJPafeDzxg+dFLjcHPy5fgFW54hYn+S/XmUraimzsEezmq6l+kcxtukhGGxaWmZy3cY3829flbfzw8eHmt3jEVx8E2oUNN/lysuGMbAMOUDWTNwBzEgVzD3XRANWIws+OT+9EmBb7nRNB5DsR6TTd+goryExMYOuks6vzFVk8Xu0tmh8vSMObfwE=</diagram></mxfile> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
我们的系统需要对用户输入的文本进行敏感词过滤如色情、政治、暴力相关的词汇。 | ||
|
||
敏感词过滤用的使用比较多的 **Trie 树算法** 和 **DFA 算法**。 | ||
|
||
## Trie 树 | ||
|
||
**Trie 树** 也称为字典树、单词查找树,哈系数的一种变种,通常被用于字符串匹配,用来解决在一组字符串集合中快速查找某个字符串的问题。像浏览器搜索的关键词提示一般就是基于 Trie 树来做的。 | ||
|
||
![](./images/sentive-words-filter/brower-trie.png) | ||
|
||
假如我们的敏感词库中有以下敏感词: | ||
|
||
- 高清有码 | ||
- 高清AV | ||
- 东京冷 | ||
- 东京热 | ||
|
||
我们构造出来的敏感词Trie树就是下面这样的: | ||
|
||
![](./images/sentive-words-filter/trie.png) | ||
|
||
|
||
|
||
当我们要查找对应的字符串“东京热”的话,我们会把这个字符串切割成单个的字符“东”、“京”、“热”,然后我们从 Trie 树的根节点开始匹配。 | ||
|
||
可以看出, **Trie 树的核心原理其实很简单,就是通过公共前缀来提高字符串匹配效率。** | ||
|
||
[Apache Commons Collecions](https://mvnrepository.com/artifact/org.apache.commons/commons-collections4) 这个库中就有 Trie 树实现: | ||
|
||
![](./images/sentive-words-filter/common-collections-trie.png) | ||
|
||
```java | ||
Trie<String, String> trie = new PatriciaTrie<>(); | ||
trie.put("Abigail", "student"); | ||
trie.put("Abi", "doctor"); | ||
trie.put("Annabel", "teacher"); | ||
trie.put("Christina", "student"); | ||
trie.put("Chris", "doctor"); | ||
Assertions.assertTrue(trie.containsKey("Abigail")); | ||
assertEquals("{Abi=doctor, Abigail=student}", trie.prefixMap("Abi").toString()); | ||
assertEquals("{Chris=doctor, Christina=student}", trie.prefixMap("Chr").toString()); | ||
``` | ||
|
||
Aho-Corasick(AC)自动机是一种建立在Trie 树上的一种改进算法,是一种多模式匹配算法,由贝尔实验室的研究人员Alfred V. Aho 和 Margaret J.Corasick 发明。 | ||
|
||
AC自动机算法使用Trie 树来存放模式串的前缀,通过失败匹配指针(失配指针)来处理匹配失败的跳转。 | ||
|
||
相关阅读:[地铁十分钟 | AC自动机](https://zhuanlan.zhihu.com/p/146369212) | ||
|
||
## DFA | ||
|
||
DFA(Deterministic Finite Automaton)即确定有穷自动机。 | ||
|
||
|
||
|