漂亮的汤,XML到数据框架

2024-06-10 09:08:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我是一个机器学习的初学者,为我的nlp项目探索数据库。这里我从http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html获得了数据。我正在尝试创建一个pd数据框架,我想在其中解析xml数据,我还想在正面评论中添加一个标签(1),有人能帮我编写代码吗,已经给出了一个示例输出

from bs4 import BeautifulSoup
positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/sorted_data_acl/electronics/positive.review', encoding='utf-8').read())
positive_reviews = positive_reviews.findAll('review_text')
positive_reviews[0]



<review_text>
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.

I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

As always, Amazon had it to me in &lt;2 business days
</review_text>

Tags: andto数据textindatamythis
1条回答
网友
1楼 · 发布于 2024-06-10 09:08:59
  • 主要问题是注意它是伪xml
  • 下载tar.gz文件并解压缩/untar
  • 建立所有文件的字典
  • 处理伪xml的变通方法—在文档的字符串表示形式中插入文档元素
  • 然后使用列表/听写理解生成熊猫构造函数格式的简单示例
  • dfs这是一个数据帧字典,可以随时使用
import requests
from pathlib import Path
from tarfile import TarFile
from bs4 import BeautifulSoup
import io
import pandas as pd

# download tar with psuedo XML...
url = "http://www.cs.jhu.edu/%7Emdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
fn = Path.cwd().joinpath(url.split("/")[-1])
if not fn.exists():
    r = requests.get(url, stream=True)
    with open(fn, 'wb') as f:
        for chunk in r.raw.stream(1024, decode_content=False):
            if chunk:
                f.write(chunk)

# untar downloaded file and generate a dictionary of all files
TarFile.open(fn, "r:gz").extractall()
files = {f"{p.parent.name}/{p.name}":p for p in Path.cwd().joinpath("sorted_data_acl").glob("**/*") if p.is_file()}

# convert all files into dataframes in a dict
dfs = {}
for file in files.keys():
    with open(files[file]) as f: text = f.read()
    # psuedo xml where there is not root element stops it from being well formed
    # force it in...
    soup = BeautifulSoup(f"<root>{text}</root>", "xml")
    # simple case of each review is a row and each child element is a column
    dfs[file] = pd.DataFrame([{c.name:c.text.strip("\n") for c in r.children if c.name} for r in soup.find_all("review")])

相关问题 更多 >