仅在通用换行符处拆分Unicode字符串（\n，\r\n）

my_text = 'Line 1\f\rLine 2\r\nLine 3\f...\nLine 4\n' # Desired output: lines = split_only_universal_newlines(my_text) print(lines) # ['Line 1\x0c\r', 'Line 2\r\n', 'Line 3\x0c...\n', 'Line 4\n'] # Note that the form feed character \f is printed as '\x0c'. # Incorrect output produced by str.splitlines: lines = my_text.splitlines(keepends=True) print(lines) # ['Line 1\x0c', '\r', 'Line 2\r\n', 'Line 3\x0c', '...\n', 'Line 4\n']

2条回答

网友

1楼 · 编辑于 2024-04-25 12:52:41

除了正则表达式，我还可以想到两种方法。第一种是使用bytes.splitlines，根据doc只拆分通用换行符。你知道吗

基于这个想法的解决方案如下。你知道吗

lines = [l.decode() for l in my_text.encode().splitlines(keepends=True)]

另一种方法是使用文本IO类：

import io

lines = list(io.StringIO(my_text, newline=''))

这里，newline关键字根据^{} docs工作如下：

The newline argument works like that of TextIOWrapper.

以及^{} docs：

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

后一种方法看起来更好，因为它不需要创建输入字符串的另一个副本（就像my_text.encode()那样）。同时，如果要迭代输入中的每一行，只需编写：

for line in io.StringIO(my_text, newline=''):
    ...

网友

2楼 · 编辑于 2024-04-25 12:52:41

使用io.StringIO(my_text, newline='').readlines()。newline=''意味着（仅）通用换行符被视为行分隔符，而且行结束符会原封不动地返回给调用者。你知道吗

import io
lines = io.StringIO(my_text, newline='').readlines()
print(lines)
# ['Line 1\x0c\r', 'Line 2\r\n', 'Line 3\x0c...\n', 'Line 4\n']

Python文档：

相关问题更多 >

编程相关推荐

热门问题

热门文章