Python未正确拆分CRLF

0 投票

3 回答

4494 浏览

提问于 2025-04-16 12:08

我正在写一个脚本，用Python把非常简单的函数文档转换成XML格式。我使用的格式可以把：

date_time_of(date) Returns the time part of the indicated date-time value, setting the date part to 0.

转换成：

<item name="date_time_of">

<arg>(date)</arg>

<help> Returns the time part of the indicated date-time value, setting the date part to 0.</help>

</item>

到目前为止，这个程序运行得很好（我上面贴的XML就是这个程序生成的），但问题是它应该能处理多行文档的粘贴，但实际上只处理了粘贴到应用程序中的第一行。我在Notepad++中检查了粘贴的文档，发现每行确实在末尾有CRLF（换行符），那么我的问题出在哪里呢？这是我的代码：

mainText = input("Enter your text to convert:\r\n")

try:
    for line in mainText.split('\r\n'):
        name = line.split("(")[0]
        arg = line.split("(")[1]
        arg = arg.split(")")[0]
        hlp = line.split(")",1)[1]
        print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
    print("Error!")

你觉得这里的问题可能是什么呢？谢谢。

脚本编写换行符多行文本 xml格式文档处理 Notepad++

3 个回答

Patrick Moriarty，

我觉得你没有特别提到控制台，你主要关心的是一次性处理多行数据。我能重现你问题的唯一方法是：在IDLE中运行程序，手动从文件中复制多行，然后粘贴到raw_input()中。

试着理解你的问题让我发现了以下几点：

当从文件中复制数据并粘贴到raw_input()时，换行符\r\n会变成\n，所以raw_input()返回的字符串中不再有\r\n。因此，无法对这个字符串使用split('\r\n')。
在Notepad++窗口中粘贴包含孤立的\r和\n字符的数据，并激活特殊字符的显示时，会在每行的两端出现CR LF符号，即使在只有\r和\n的地方也是如此。因此，使用Notepad++来验证换行符的性质会导致错误的结论。

第一个事实是你问题的根源。我不知道为什么从文件中复制的数据会发生这种转换，所以我在stackoverflow上发了个问题：

从文件内容复制到raw_input()时CR奇怪消失

第二个事实是造成你困惑和绝望的原因。没办法……

那么，如何解决你的问题呢？

这里有段代码可以重现这个问题。注意其中修改的算法，替换了你对每行重复使用的split。

ch = "date_time_of(date) Returns the time part.\r\n"+\
     "divmod(a, b) Returns quotient and remainder.\r\n"+\
     "enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
     "A\rB\nC"

with open('funcdoc.txt','wb') as f:
    f.write(ch)

print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)

print "open 'funcdoc.txt' to manually copy its content, and paste it on the following line"
mainText = raw_input("Enter your text to convert:\n")
print "OK, copy-paste of file 'funcdoc.txt' ' s content has been performed"


print "\nrepr(mainText)==",repr(mainText)

try:
    for line in mainText.split('\r\n'):  
        name,_,arghelp  = line.partition("(")
        arg,_,hlp = arghelp.partition(") ")
        print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
    print("Error!")

这是delnan提到的解决方案：“直接从源头读取，而不是让人手动复制粘贴。”这样可以正常使用你的split('\r\n')：

ch = "date_time_of(date) Returns the time part.\r\n"+\
     "divmod(a, b) Returns quotient and remainder.\r\n"+\
     "enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
     "A\rB\nC"

with open('funcdoc.txt','wb') as f:
    f.write(ch)

print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)

#####################################

with open('funcdoc.txt','rb') as f:
    mainText = f.read()

print "\nfile 'funcdoc.txt' has just been opened and its content copied and put to mainText"

print "\nrepr(mainText)==",repr(mainText)
print

try:
    for line in mainText.split('\r\n'):  
        name,_,arghelp  = line.partition("(")
        arg,_,hlp = arghelp.partition(") ")
        print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
    print("Error!")

最后，Python提供了处理这种人为复制问题的解决方案：提供splitlines()函数，可以把所有类型的换行符（\r、\n或\r\n）都当作分隔符。所以把

for line in mainText.split('\r\n'):

替换为

for line in mainText.splitlines():

回答于 2025-04-16 由 Python大师

分享举报

处理从标准输入（也就是控制台）读取行的最佳方法是遍历 sys.stdin 对象。这样修改后，你的代码大概会像这样：

from sys import stdin
try:
  for line in stdin:
    name = line.split("(")[0]
    arg = line.split("(")[1]
    arg = arg.split(")")[0]
    hlp = line.split(")",1)[1]
    print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
    print("Error!")

另外，值得一提的是，使用正则表达式可以大大简化你的解析代码。这里有个例子：

import re, sys

for line in sys.stdin:
  result = re.match(r"(.*?)\((.*?)\)(.*)", line)
  if result:
    name = result.group(1)
    arg  = result.group(2).split(",")
    hlp  = result.group(3)
    print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
  else:
    print "There was an error parsing this line: '%s'" % line

希望这能帮助你简化代码。

回答于 2025-04-16 由 Python大师

分享举报

input() 这个函数只会读取一行内容。

你可以试试这个。输入一个空行来停止收集内容。

lines = []
while True:
    line = input('line: ')
    if line:
        lines.append(line)
    else:
        break
print(lines)

回答于 2025-04-16 由 Python大师

分享举报

Python未正确拆分CRLF

3 个回答

撰写回答