我的机器上有一个本地目录,其中包含多个html文件,所有这些文件的命名格式如下
> XXXXXXXX_XXXX-XX-XX.html
X表示数字字符(在u之前的数字字符的数量不同)。我访问文件夹中的所有文件,然后根据正则表达式匹配(查找子字符串“段”)提取两个css样式类(“font”)和(“p style”)中的所有字符串
输出是所有提取字符串内容的数据帧,例如:
 Like our Prescription Pharmaceuticals segment, the manufacturing of our Consumer Health products is competitive, with many established manufacturers engaged in all phases of the business. With the Company’s relatively small OTC [...]
我需要帮助更改输出,如下所示:
代码如下-任何帮助将不胜感激,谢谢
import os
from bs4 import BeautifulSoup
from tqdm import tqdm
import re
import pandas as pd
import csv
rootdir = "C://directory//subdirectory"
segments_font=[]
segments_p_style=[]
for subdir, dirs, files in os.walk(rootdir):
for file in files:
filepath = subdir + os.sep + file
soup = BeautifulSoup(open(filepath))
for elem in tqdm(soup.find_all('font',text=re.compile(r'segment'))):
segments_font.append(elem)
for elem in tqdm(soup.find_all('p style',text=re.compile(r'segment'))):
segments_p_style.append(elem)
combined_list=list(set().union(segments_font,segments_p_style))
df=pd.DataFrame(data=combined_list,columns=['segments'])
df.to_csv('output.csv')
目前没有回答
相关问题 更多 >
编程相关推荐