常用正则表达式

2023-05-04

1. 判断中文

def is_chinese(uchar):
    """判断一个unicode是否是汉字"""
    return '\u4e00' <= uchar <= '\u9fa5'


def is_chinese_string(string):
    """判断是否全为汉字"""
    return all(is_chinese(c) for c in string)

2. 中英韩日字符

函数	说明
\u4e00-\u9fa5	汉字的unicode范围
\u0030-\u0039	数字的unicode范围
\u0041-\u005a	大写字母unicode范围
\u0061-\u007a	小写字母unicode范围
\uAC00-\uD7AF	韩文的unicode范围
\u3040-\u31FF	日文的unicode范围

3. 过滤掉非中、英、数字字符

1
2
3

def clean(x: str):
    str_text = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", x)
    return str_text

4. 正向反向提取

这个在BELLE项目提取chatGPT生成的数据里面有涉及到。

1
2
3

intruction_pattern = re.compile(r"(?<=(?:" + '|'.join(['指令:', '指令：']) + "))[\s\S]*?(?=" + '|'.join(['输入:', '输入：']) + ")")
input_pattern = re.compile(r"(?<=(?:" + '|'.join(['输入:', '输入：']) + "))[\s\S]*?(?=" + '|'.join(['输出:', '输出：']) + ")")
output_pattern = re.compile(r"(?<=(?:" + '|'.join(['输出:', '输出：']) + "))[\s\S]*?(?=$)")

5. or_match

def re_escape_word(word: str):
    if len(word) > 1:
        # 每个字中间都添加空格
        return r"\s{0,}".join(list(word))
    return re.escape(word)


def re_escape_words(words: Iterable[str]):
    return r"|".join([re_escape_word(word) for word in words])


def or_match(tokens: Iterable[str]):
    return r"(?:" + re_escape_words(tokens) + ")"


def or_match_with_name(tokens: Iterable[str]):
    return r'(?P<or_match_with_name>(' + re_escape_words(tokens) + r'))'

资源

正则表达式