Python将HTML文件中的文本转换为csv而不使用唯一的标识符标记

<h5>Practice Locations</h5> <p>Springfield, 1234<br/> 08 1234 5678</p> <p>Shelbyville, 1234<br/>08 1234 5678</p> <h5>Gender:</h5> <p>Male<br/></p> <h5>Languages spoken (other than English):</h5> <p>Spanish<br/></p> <p>Italian<br/></p> <h5>Problem areas treated:</h5> <p>Anxiety disorders<br/>Mood disorders<br/>Sexual disorders<br/></p> <h5>Populations treated:</h5> <p>Adult<br/>Young adult<br/></p> <h5>Subspecialty areas:</h5> <p>Cancer patients<br/>Gender issues<br/>Pain management<br/>Specialist psychotherapist<br/></p> <h5>Treatments and services offered:</h5> <p>Does not prescribe psychotropics<br/>Psychotherapy – cognitive behavioural therapy (CBT)<br/>Psychotherapy – hypnotherapy<br/>Psychotherapy – interpersonal<br/>Psychotherapy – marital therapy<br/></p> <h5>Practice details:</h5> <p>Can bulk bill selected patients<br/></p> <p> </p>

1条回答

网友

1楼 · 发布于 2024-06-01 01:35:14

使用h5标记作为标题：

import re
from bs4 import BeautifulSoup as soup
import itertools
headers = [i.text for i in soup(content, 'html.parser').find_all('h5')]
full_data = [[i.text, i] for i in soup(content, 'html.parser').find_all(re.compile('h5|p'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(full_data, key=lambda x:x[0] in headers)]
grouped = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]
final_data = {c:{i:str(h)[3:-4].split('<br/>')[1:] for i, h in results} for [_, [[c, _]], _, results] in grouped}

输出：

{'Practice Locations': {'Springfield, 1234 08 1234 5678': [' 08 1234 5678'], 'Shelbyville, 123408 1234 5678': ['08 1234 5678']}, 'Gender:': {'Male': ['']}, 'Languages spoken (other than English):': {'Spanish': [''], 'Italian': ['']}, 'Problem areas treated:': {'Anxiety disordersMood disordersSexual disorders': ['Mood disorders', 'Sexual disorders', '']}, 'Populations treated:': {'AdultYoung adult': ['Young adult', '']}, 'Subspecialty areas:': {'Cancer patientsGender issuesPain managementSpecialist psychotherapist': ['Gender issues', 'Pain management', 'Specialist psychotherapist', '']}, 'Treatments and services offered:': {'Does not prescribe psychotropicsPsychotherapy – cognitive behavioural therapy (CBT)Psychotherapy – hypnotherapyPsychotherapy – interpersonalPsychotherapy – marital therapy': ['Psychotherapy – cognitive behavioural therapy (CBT)', 'Psychotherapy – hypnotherapy', 'Psychotherapy – interpersonal', 'Psychotherapy – marital therapy', '']}, 'Practice details:': {'Can bulk bill selected patients': [''], ' ': []}}

相关问题更多 >

编程相关推荐

热门问题

热门文章