如何在Python中使用正则表达式提取字符串

1 投票

3 回答

5833 浏览

提问于 2025-04-16 14:35

我正在尝试在Python中从一个字符串中提取子字符串。

我的数据文件包含《古兰经》的每一行，每一行的开头都有经文和章节的编号。我想提取第一个数字和第二个数字，并把它们写入另一个文本文件的一行中。以下是文本文件中的几行示例。

2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.

如你所见，章节和经文的编号可能包含多个数字，所以仅仅从字符串的开头数空格是不够的。有没有办法使用正则表达式来提取第一个数字（经文）和第二个数字（章节）作为字符串？

我写的代码会尝试将经文和章节字符串写入一个Arff文件。Arff文件中的一行示例如下：

1,0,0,0,0,0,0,0,0,2,12

其中最后两个值就是经文和章节。

这是一个循环，它会为每个经文写入我感兴趣的属性，然后我想尝试使用正则表达式提取每一行的相关子字符串，将经文和章节写到最后。

for line in verses:
    for item in topten:
        count = line.count(item)
        ARFF_FILE.write(str(count) + ",")
    # Here is where i could use regular expressions to extract the desired substring 
    # verse and chapter then write these to the end of a line in the arff file.
    ARFF_FILE.write("\n")

我认为章节编号（管道前的第一个数字）的正则表达式应该是这样的，然后使用group(0)函数来获取第一个数字，

"^(\d+)\|(\d)\|"

然后经文的正则表达式应该通过group(1)来获得。

但是我不知道如何在Python中实现这一点。有没有人有什么想法？

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

对一个问题的回应。

我刚刚尝试实现你的方法，但出现了“索引错误：列表索引超出范围”。我的代码是：

for line in verses:
 for item in topten:
     parts = line.split('|')

     count = line.count(item)
     ARFF_FILE.write(str(count) + ",")
 ARFF_FILE.write(parts[0] + ",")
 ARFF_FILE.write(parts[1])  
 ARFF_FILE.write("\n")

正则表达式数据处理数据分析文本文件索引错误字符串提取 arff文件编号提取

3 个回答

-1

带上括号？难道这不是所有正则表达式的工作方式吗？

回答于 2025-04-16 由 Python大师

分享举报

我觉得最简单的方法是用 re.split() 来获取经文的文本，用 re.findall() 来获取章节和经文的编号。这样得到的结果会存储在列表里，后面可以用到。下面是一个代码示例：

#!/usr/bin/env python

import re

# string to be parsed
Quran= '''2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.'''

# list containing the text of all the verses
verses=re.split(r'[0-9]+\|[0-9]+\|',Quran)
verses.remove("")

# list containing the chapter and verse number:
#
#   if you look closely, the regex should be r'[0-9]+\|[0-9]+\|'
#   i ommited the last pipe character so that later when you need to split
#   the string to get the chapter and verse nembuer you wont have an
#   empty string at the end of the list
#
chapter_verse=re.findall(r'[0-9]+\|[0-9]+',Quran)


# looping over the text of the verses assuming len(verses)==len(chp_vrs)
for index in range(len(verses)):
    chapterNumber,verseNumber =chapter_verse[index].split("|")
    print "Chapter :",chapterNumber, "\tVerse :",verseNumber
    print verses[index]

回答于 2025-04-16 由 Python大师

分享举报

如果你所有的行都是像 A|B|C 这样的格式，那么你不需要用到复杂的正则表达式，只需要把它分开就可以了。

for line in fp:
    parts = line.split('|') # or line.split('|', 2) if the last part can contain |
    # use parts[0], parts[1]

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中使用正则表达式提取字符串

3 个回答

撰写回答