Python将HTML文件中的文本转换为csv而不使用唯一的标识符标记

2024-06-01 01:35:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我用beautifulsoup4从一个网页上搜集了一些我想要的信息,这个网页列出了一个精神病医生执业的细节,并设法把这部分的关键信息拿回来

<h5>Practice Locations</h5>
    <p>Springfield, 1234<br/> 08 1234 5678</p>
    <p>Shelbyville, 1234<br/>08 1234 5678</p>
<h5>Gender:</h5>
    <p>Male<br/></p>
<h5>Languages spoken (other than English):</h5>
    <p>Spanish<br/></p>
    <p>Italian<br/></p>
<h5>Problem areas treated:</h5>
    <p>Anxiety disorders<br/>Mood disorders<br/>Sexual disorders<br/></p>
<h5>Populations treated:</h5>
<p>Adult<br/>Young adult<br/></p>
<h5>Subspecialty areas:</h5>
    <p>Cancer patients<br/>Gender issues<br/>Pain management<br/>Specialist psychotherapist<br/></p>
<h5>Treatments and services offered:</h5>
    <p>Does not prescribe psychotropics<br/>Psychotherapy – cognitive behavioural therapy (CBT)<br/>Psychotherapy – hypnotherapy<br/>Psychotherapy – interpersonal<br/>Psychotherapy – marital therapy<br/></p>
<h5>Practice details:</h5>
    <p>Can bulk bill selected patients<br/></p>
<p> </p>

我想把每个标题下的信息放到.csv文件的一列中,但我不知道怎么做,因为标题没有任何唯一的标识符。我知道我必须使用标题以某种方式来划分单独的列,但我对python完全陌生,不知道该怎么做

这将是很容易做到手动,但我想收集这些信息,从许多网页格式相同的方式。 更复杂的是,有些页面缺少某些标题的信息(例如,他们没有列出被处理的人群或子专业),所以我必须检查每个标题是否存在,然后再尝试收集这些信息

任何指导都将不胜感激


Tags: br信息网页标题方式genderh5therapy
1条回答
网友
1楼 · 发布于 2024-06-01 01:35:14

使用h5标记作为标题:

import re
from bs4 import BeautifulSoup as soup
import itertools
headers = [i.text for i in soup(content, 'html.parser').find_all('h5')]
full_data = [[i.text, i] for i in soup(content, 'html.parser').find_all(re.compile('h5|p'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(full_data, key=lambda x:x[0] in headers)]
grouped = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]
final_data = {c:{i:str(h)[3:-4].split('<br/>')[1:] for i, h in results} for [_, [[c, _]], _, results] in grouped}

输出:

{'Practice Locations': {'Springfield, 1234 08 1234 5678': [' 08 1234 5678'], 'Shelbyville, 123408 1234 5678': ['08 1234 5678']}, 'Gender:': {'Male': ['']}, 'Languages spoken (other than English):': {'Spanish': [''], 'Italian': ['']}, 'Problem areas treated:': {'Anxiety disordersMood disordersSexual disorders': ['Mood disorders', 'Sexual disorders', '']}, 'Populations treated:': {'AdultYoung adult': ['Young adult', '']}, 'Subspecialty areas:': {'Cancer patientsGender issuesPain managementSpecialist psychotherapist': ['Gender issues', 'Pain management', 'Specialist psychotherapist', '']}, 'Treatments and services offered:': {'Does not prescribe psychotropicsPsychotherapy – cognitive behavioural therapy (CBT)Psychotherapy – hypnotherapyPsychotherapy – interpersonalPsychotherapy – marital therapy': ['Psychotherapy – cognitive behavioural therapy (CBT)', 'Psychotherapy – hypnotherapy', 'Psychotherapy – interpersonal', 'Psychotherapy – marital therapy', '']}, 'Practice details:': {'Can bulk bill selected patients': [''], ' ': []}}

相关问题 更多 >