读取本地html文件并使用python转换为数据帧

2024-04-20 12:26:51 发布

男 | 程序猿一只，喜欢编程写python代码。

我的机器上有一个本地目录，其中包含多个html文件，所有这些文件的命名格式如下

> XXXXXXXX_XXXX-XX-XX.html

X表示数字字符（在u之前的数字字符的数量不同）。我访问文件夹中的所有文件，然后根据正则表达式匹配（查找子字符串“段”）提取两个css样式类（“font”）和（“p style”）中的所有字符串

输出是所有提取字符串内容的数据帧，例如：

Â Like our Prescription Pharmaceuticals segment, the manufacturing of our Consumer Health products is competitive, with many established manufacturers engaged in all phases of the business. With the Companyâ€™s relatively small OTC [...]

我需要帮助更改输出，如下所示：

我想在dataframe输出中添加另一列，该列查找文件名中“\u1”前面的数字字符。这样，我就可以将字符串描述与它们各自的html文件源进行匹配
正如上面的输出代码片段中所述，我遇到了各种各样的unicode错误，我想消除这些错误。在早期版本的代码中，我尝试使用utf-8编码（在第14行，soup=…），但没有成功

代码如下-任何帮助将不胜感激，谢谢

import os
from bs4 import BeautifulSoup
from tqdm import tqdm
import re
import pandas as pd
import csv

rootdir = "C://directory//subdirectory"

segments_font=[]
segments_p_style=[]

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        filepath = subdir + os.sep + file
        soup = BeautifulSoup(open(filepath))
        for elem in tqdm(soup.find_all('font',text=re.compile(r'segment'))):
            segments_font.append(elem)
        for elem in tqdm(soup.find_all('p style',text=re.compile(r'segment'))):
            segments_p_style.append(elem)
    combined_list=list(set().union(segments_font,segments_p_style))

    df=pd.DataFrame(data=combined_list,columns=['segments'])
    df.to_csv('output.csv')

Tags：文件字符串 in import for style html segment

0条回答

目前没有回答

读取本地html文件并使用python转换为数据帧

相关问题更多 >

编程相关推荐

热门问题

热门文章

读取本地html文件并使用python转换为数据帧

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >