自定义Graph Component：1.2-其它Tokenizer具体实现

飞书用户4443

2023年12月5日修改

本文主要介绍了Rasa中相关Tokenizer的具体实现，包括默认Tokenizer和第三方Tokenizer。前者包括JiebaTokenizer、MitieTokenizer、SpacyTokenizer和WhitespaceTokenizer，后者包括BertTokenizer和AnotherWhitespaceTokenizer。​

一.JiebaTokenizer

JiebaTokenizer类整体代码结构，如下所示：

common.docs_name - LarkCCM_Docs_Menu_Image

加载自定义字典代码，如下所示[3]：

代码块

@staticmethod​
def _load_custom_dictionary(path: Text) -> None:​
    """Load all the custom dictionaries stored in the path.  # 加载存储在路径中的所有自定义字典。​
    More information about the dictionaries file format can be found in the documentation of jieba. https://github.com/fxsjy/jieba#load-dictionary​
    """​
    print("JiebaTokenizer._load_custom_dictionary()")​
    import jieba​
​
    jieba_userdicts = glob.glob(f"{path}/*")  获取路径下的所有文件。​
    for jieba_userdict in jieba_userdicts:  遍历所有文件。​
        logger.info(f"Loading Jieba User Dictionary at {jieba_userdict}")  加载结巴用户字典。​
        jieba.load_userdict(jieba_userdict)  加载用户字典。​

实现分词的代码为tokenize()方法，如下所示：

代码块

def tokenize(self, message: Message, attribute: Text) -> List[Token]:​
    """Tokenizes the text of the provided attribute of the incoming message."""  对传入消息的提供属性的文本进行tokenize。​
    print("JiebaTokenizer.tokenize()")​
​
    import jieba​
​
    text = message.get(attribute)  获取消息的属性​
​
    tokenized = jieba.tokenize(text)  对文本进行标记化​
    tokens = [Token(word, start) for (word, start, end) in tokenized]  生成标记​
​
    return self._apply_token_pattern(tokens)​

  self._apply_token_pattern(tokens)数据类型为List[Token]。Token的数据类型为：​

代码块

class Token:​
    由将单个消息拆分为多个Token的Tokenizers使用​
    def __init__(​
        self,​
        text: Text,​
        start: int,​
        end: Optional[int] = None,​
        data: Optional[Dict[Text, Any]] = None,​
        lemma: Optional[Text] = None,​
    ) -> None:​
        """创建一个Token​
        Args:​
            text: The token text.  # token文本​
            start: The start index of the token within the entire message.  # token在整个消息中的起始索引​
            end: The end index of the token within the entire message.  # token在整个消息中的结束索引​
            data: Additional token data.  # 附加的token数据​
            lemma: An optional lemmatized version of the token text.  # token文本的可选词形还原版本​
        """​
        self.text = text​
        self.start = start​
        self.end = end if end else start + len(text)​
        self.data = data if data else {}​
        self.lemma = lemma or text​

自定义Graph Component：1.2-其它Tokenizer具体实现​

自定义Graph Component：1.2-其它Tokenizer具体实现