使用编码工具

飞书用户4443

2023年12月8日修改

本文主要介绍了对句子编码的过程，以及如何使用PyTorch中自带的编码工具，包括基本编码encode()、增强编码encode_plus()和批量编码batch_encode_plus()。​

一.对一个句子编码例子

      假设想在要对句子'the quick brown fox jumps over a lazy dog'进行编码，该如何处理呢？简单理解编码就是用数字表示单词，并且用特殊符号代表一个句子的开头和结束。 vocab表示一个例子字典，在句子的开头和结束添加和特殊符号，然后就可以得知每个单词对应的数字：​

代码块

def encode_example_test():​
    # 字典​
    vocab = {​
        '<SOS>': 0,​
        '<EOS>': 1,​
        'the': 2,​
        'quick': 3,​
        'brown': 4,​
        'fox': 5,​
        'jumps': 6,​
        'over': 7,​
        'a': 8,​
        'lazy': 9,​
        'dog': 10,​
    }​
​
    # 简单编码​
    sent = 'the quick brown fox jumps over a lazy dog'​
    sent = '<SOS> ' + sent + ' <EOS>'​
    print(sent)​
​
    # 英文分词​
    words = sent.split()​
    print(words)​
​
    # 编码为数字​
    encode = [vocab[i] for i in words]​
    print(encode)​

      可见编码工作流程包括定义字典、句子预处理、分词和编码4个步骤：​

common.docs_name - LarkCCM_Docs_Menu_Image

二.使用编码工具

接下来介绍使用HuggingFace提供的编码工具。

1.基本的编码函数encode()

代码块

def encode_test():​
    # 第2章/加载编码工具​
    from transformers import BertTokenizer​
    tokenizer = BertTokenizer.from_pretrained(​
        pretrained_model_name_or_path='bert-base-chinese',  # 通常编码工具和模型名字一致​
        cache_dir=None,  # 编码工具的缓存路径​
        force_download=False,  # 是否强制下载，当为True时，无论是否有本地缓存，都会强制下载​
    )​
​
    # 第2章/准备实验数据​
    sents = [​
        '你站在桥上看风景',​
        '看风景的人在楼上看你',​
        '明月装饰了你的窗子',​
        '你装饰了别人的梦',​
    ]​
​
    # 第2章/基本的编码函数​
    out = tokenizer.encode(​
        text=sents[0],​
        text_pair=sents[1],  # 如果只想编码一个句子，可设置text_pair=None​
        truncation=True,  # 当句子长度大于max_length时截断​
        padding='max_length',  # 一律补PAD，直到max_length长度​
        add_special_tokens=True,  # 需要在句子中添加特殊符号​
        max_length=25,  # 最大长度​
        return_tensors=None,  # 返回的数据类型为list格式，也可以赋值为tf、pt、np，分别表示TensorFlow、PyTorch、NumPy数据格式​
    )​
    print(out)​
    print(tokenizer.decode(out))​

使用编码工具​

使用编码工具