漂亮的汤，XML到数据框架

from bs4 import BeautifulSoup positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/sorted_data_acl/electronics/positive.review', encoding='utf-8').read()) positive_reviews = positive_reviews.findAll('review_text') positive_reviews[0] <review_text> I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad. It will run my cable modem, router, PC, and LCD monitor for 5 minutes. This is more than enough time to save work and shut down. Equally important, I know that my electronics are receiving clean power. I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply. As always, Amazon had it to me in <2 business days </review_text>

1条回答

网友

1楼 · 发布于 2024-06-10 09:08:59

主要问题是注意它是伪xml
下载tar.gz文件并解压缩/untar
建立所有文件的字典
处理伪xml的变通方法—在文档的字符串表示形式中插入文档元素
然后使用列表/听写理解生成熊猫构造函数格式的简单示例
dfs这是一个数据帧字典，可以随时使用

import requests
from pathlib import Path
from tarfile import TarFile
from bs4 import BeautifulSoup
import io
import pandas as pd

# download tar with psuedo XML...
url = "http://www.cs.jhu.edu/%7Emdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
fn = Path.cwd().joinpath(url.split("/")[-1])
if not fn.exists():
    r = requests.get(url, stream=True)
    with open(fn, 'wb') as f:
        for chunk in r.raw.stream(1024, decode_content=False):
            if chunk:
                f.write(chunk)

# untar downloaded file and generate a dictionary of all files
TarFile.open(fn, "r:gz").extractall()
files = {f"{p.parent.name}/{p.name}":p for p in Path.cwd().joinpath("sorted_data_acl").glob("**/*") if p.is_file()}

# convert all files into dataframes in a dict
dfs = {}
for file in files.keys():
    with open(files[file]) as f: text = f.read()
    # psuedo xml where there is not root element stops it from being well formed
    # force it in...
    soup = BeautifulSoup(f"<root>{text}</root>", "xml")
    # simple case of each review is a row and each child element is a column
    dfs[file] = pd.DataFrame([{c.name:c.text.strip("\n") for c in r.children if c.name} for r in soup.find_all("review")])

相关问题更多 >

编程相关推荐

热门问题

热门文章