CSV文件包含偶尔在圆括号中使用逗号的列崩溃Pandas。阅读

3条回答

网友

1楼 · 编辑于 2024-04-18 02:28:19

正如评论中所说，您可以用“断开的”csv行来构造语法，并将结果输出提供给pandas数据帧。
以下内容当然可以优化，但可能会给您一个想法：

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import pandas as pd

broken_garbage = """
1, (2, 3), 4
colAVal, (colBVal_1, colBVal_2), colCVal,
this, one, right
234,(123,456),789
"""

grammar = Grammar(
    r"""
    content     = garbage? line+
    line        = entry+ newline?
    entry       = value sep?
    value       = word / (lpar word sep word rpar)

    lpar        = "("
    rpar        = ")"
    word        = ~"\w+"
    sep         = ws? "," ws?

    ws          = ~"[\t ]+"
    newline     = ~"[\r\n]+"
    garbage     = (ws / newline)+
    """
)

class BrokenVisitor(NodeVisitor):
    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_value(self, node, visited_children):
        child = visited_children[0]
        if isinstance(child, list):
            _, value1, _, value2, _ = child
            return (value1.text, value2.text)
        else:
            return child.text

    def visit_entry(self, node, visited_children):
        values, _ = visited_children
        return values

    def visit_line(self, node, visited_children):
        content = visited_children[0]
        return [item for item in content]

    def visit_content(self, node, visited_children):
        return visited_children[1]

tree = grammar.parse(broken_garbage)

broken = BrokenVisitor()
values = broken.visit(tree)

df = pd.DataFrame(values, columns=["one", "two", "three"])
print(df)

这就产生了 ^{pr2}$

看看能反映你的结构的语法。BrokenVisitor类访问每个语法块并以列表形式返回行。然后将此结果输入pandas.DataFrame构造函数。

或者，您可以使用支持\K的较新的^{} module，并用另一个字符替换括号中的所有逗号：

\([^,()]+\K,

在Python中，这可能是：

import regex as re

rx = re.sub(r'\([^,()]+\K,')
new_string = rx.sub('@', old_string)

然后，您可以直接将新字符串输入pandas.read_csv()。
见a demo on regex101.com。在

网友

2楼 · 编辑于 2024-04-18 02:28:19

感谢您的建议，只搜索和替换。效果很好。添加了下面的代码以备其他人遇到此类问题时参考。在

from StringIO import StringIO
import pandas as pd
text = open('file/location', "r")
        text = StringIO(''.join([i for i in text]) \
            .replace("(colBVal_1, colBVal_2)", "(colBVal_1 colBVal_2)"))
        df= pd.read_csv( text )

网友

3楼 · 编辑于 2024-04-18 02:28:19

如果没有看到任何示例数据，很难知道需要什么，但是：

import re
import pandas as pd

def my_parser(csv_file)
    with open(csv_file, "r") as fh:
        for line in fh:
            line = line.strip()

            if re.match(r".*\(.*,.*\).*", line):
                # Process line with extra commas
                # ...
            else:
                # Process normal line
                # ...

            yield processed_line


df = pd.Dataframe(my_parser("file.csv"), ...)

对于处理，您可以尝试只将括号中的逗号替换为另一个字符。在

我建议使用^{}作为结构来保存您的processed_line，因为它们有一些字段被{}自动用作序列名；但是您必须进行一些类型检查或指定，因为pandas将把所有条目都视为字符串。在

相关问题更多 >

编程相关推荐

热门问题

热门文章