在“data\u in.read().replace(“<*”,“<”).replace(“*\n”,”“)行丢失记录

2024-03-29 15:29:19 发布

您现在位置:Python中文网/ 问答频道 /正文

在运行下面的代码之后,我一直在试图找出为什么数据库中700多条记录中有47条丢失。请帮助查看这是Python中的编码错误还是内存限制

def create_csv_file():
    source_html = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Raw).txt', 'r')
    bs_object = BeautifulSoup(source_html, "html.parser")

    data_out = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\temp.csv', 'w+')
    data_in = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\temp.csv', 'r')
    csv_file1 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Processed).csv', 'w+')
    csv_file2 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Processed).csv', 'r')
    csv_file3 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Processed).csv', 'w+')

    writer1 = csv.writer(data_out, delimiter='<', skipinitialspace=True)

    table = bs_object.find("table", {"id":"gasOfferSearch"})
    rows = table.findAll("tr")

    try:
        # Iterates through the list, but skips the first record (i.e. the table header)
        for row in rows[1:]:
            csvRow = []
            for cell in row.findAll(['td','th']):
                # Replace "\n" with a whitespace; replace <br> tags with 5 whitespaces
                line = str(cell).replace('\n', ' ').replace('<br>', '     ')
                # Replace 2 or more spaces with "\n"
                line = re.sub('\s{2,}', '*', line)
                # Converts results to a BeautifulSoup object
                line_bsObj = BeautifulSoup(line, "html.parser")
                # Strips: Removes all tags and trailing and leading whitespaces
                # Replace: Removes all quotation marks
                csvRow.append(line_bsObj.get_text().strip().replace('"',''))

            # Converts the string into a csv file
            writer1.writerow(csvRow)

        # Reads from the temp file and replaces all "<*" with "<"
        # TODO: Issue - 47 records missing with replacement
        temp_string = data_in.read().replace("<*", "<").replace("*\n", "")
        csv_file1.write(temp_string)

        # Clear the temp_string variable
        temp_string = ""
        for line in csv_file2.readlines():
            temp_string += line.replace("*", "<", 1)

        csv_file3.write(temp_string)

    finally:
        source_html.close()
        csv_file1.close()
        csv_file2.close()
        data_out.close()
        data_in.close()

        # Remove the temp file
        # os.remove('C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\temp.csv')

    return None

Tags: csvthedatastringrawadminlineopen
1条回答
网友
1楼 · 发布于 2024-03-29 15:29:19

我不知道到底出了什么问题,但这里有一些一般性的建议:

  • 不要同时打开同一个文件三次(csv_file[1,2,3]相同)
  • 添加print命令来仔细检查发生了什么:
    • 在打印总行数的for now in rows前面加一个
    • 把它们放在temp_string = data_in...周围以确保这些数字是正确的
  • 如果所有这些都不能说明问题所在,那就贴几张样品记录让我们看看

相关问题 更多 >