python中使用引号的正则表达式

2024-04-24 12:35:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图为存储在文件中的类似于下面的字符串创建正则表达式模式。其目的是为任何行获取任何列,行不必在一行上。例如,考虑以下文件:

"column1a","column2a","column
  3a,",             #entity 1
"column\"this is, a test\"4a"
"column1b","colu
     mn2b,","column3b",             #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",             #entity 3
"column\"this is, a test\"4c"

每个实体由四列组成,实体2的第4列是“column”,这是测试“4b”,实体3的第2列是“column2c”。每列以引号开始,以引号结束,但是您必须小心,因为有些列有转义引号。提前谢谢!你知道吗


Tags: 文件字符串test目的实体is模式column
2条回答

你可以这样做

  1. 阅读整个文件。

  2. 根据不带逗号的换行符拆分输入。

  3. 迭代被吐出的元素,然后再次对逗号(和下面可选的换行符)进行拆分,逗号前面和后面都有双引号。

代码:

import re
with open(file) as f:
    fil = f.read()
    m = re.split(r'(?<!,)\n', fil.strip())
    for i in m:
        print(re.split('(?<="),\n?(?=")', i))

输出:

['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']

这是支票。。你知道吗

$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']

f是输入文件名,f.py是包含python脚本的文件名。你知道吗

你的问题对于我每个月要处理三次的事情来说非常熟悉:)除了我没有使用python来解决它,但是我可以“翻译”我通常做的事情:

text = r'''"column1a","column2a","column
  3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
     mn2b,","column3b",             
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''

import re

# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')

# Read the file line by line
for line in text.split("\n"):
    # If there's no stored partial line, this is a new line
    if buffer == "":
        # Check if we get 4 columns and print, if not, put the line
        # into buffer so we store a partial line for later
        if len(check.findall(line)) == columns:
            print matches
        else:
            # use line.strip() if you need to trim whitespaces
            buffer = line
    else:
        # Update the variable (containing a partial line) with the
        # next line and recheck if we get 4 columns
        # use line.strip() if you need to trim whitespaces
        buffer = buffer + line
        # If we indeed get 4, our line is complete and print
        # We must not forget to empty buffer now that we got a whole line
        if len(check.findall(buffer)) == columns:
            print matches
            buffer = ""
        # Optional; always good to have a safety backdoor though
        # If there is a problem with the csv itself like a weird unescaped
        # quote, you send it somewhere else
        elif len(check.findall(buffer)) > columns:
            print "Error: cannot parse line:\n" + buffer
            buffer = ""

ideone demo

相关问题 更多 >