使用python从fi中获取字符串之间的有用数据

2024-04-23 05:28:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个HTML源代码文件中的以下原始数据

{$deletedFields:[courses,projects,description,degreeName,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),school:urn:li:fs_miniSchool:11709,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,schoolName:Charles University in Prague,fieldOfStudy:Economics, Politics,schoolUrn:urn:li:fs_miniSchool:11709,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[courses,projects,description,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),school:urn:li:fs_miniSchool:17888,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,degreeName:BA,schoolName:Occidental College,fieldOfStudy:Economics,schoolUrn:urn:li:fs_miniSchool:17888,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[],profileId:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,elements:[urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717)],paging:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView,paging,$type:com.linkedin.voyager.identity.profile.EducationView,$id:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView},
{$deletedFields:[],start:501,end:1000,$type:com.linkedin.voyager.identity.profile.EmployeeCountRange,$id:urn:li:fs_position:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,323432440),company,employeeCountRange}



{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,startDate},
{$deletedFields:[month,day],year:2004,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,startDate},
{$deletedFields:[month,day],year:2008,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,endDate},
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,endDate},

我需要的是从中提取一些数据。你知道吗

schoolname = re.findall(r',schoolname:(.*?),' , page_html)
fieldofstudy = skills = re.findall(r'fieldOfStudy:(.*?),s' , page_html)
degreename = re.findall(r'degreeName:(.*?),' , page_html)

需要的输出

s码校名:查尔斯大学在布拉格

五研究方向:经济学,政治

开始时间:2007年

结束日期:2007年

s码学校名称:西方学院你知道吗

五研究方向:经济学你知道吗

d级白鹭名:BA你知道吗

开始时间:2004年

结束日期:2008年


Tags: comidtypeliprofilefsidentitylinkedin
1条回答
网友
1楼 · 发布于 2024-04-23 05:28:59

Question: What i need is extract some data out of it using

定义数据容器class School

class School(object):
    def __init__(self, raw_data):
        key = None
        year = '?'
        for kv in raw_data:
            i = kv.find(':')
            if i >= 0:
                key = kv[0:i]
                value = kv[i + 1:]
                if key in ['schoolName', 'fieldOfStudy', 'startDate', 'endDate', 'degreeName']:
                    object.__setattr__(self, key, value)

                if key in ['year']:
                    year = value
            else:
                if key in ['entityUrn', '$id']:
                    if kv[:-1].isdigit():
                        self.entity = kv[:-1]
                elif key in ['fieldOfStudy']:
                    self.fieldOfStudy += ', '+kv
                elif kv in ['startDate', 'endDate']:
                    object.__setattr__(self, kv, year)
                key = ''

            if not hasattr(self, 'degreeName'):
                self.degreeName = 'unknown'

    def __repr__(self):
        return "entity:\t\t{s.entity:>28}\n" \
               "schoolName:\t{s.schoolName:>28}\n" \
               "fieldOfStudy:{s.fieldOfStudy:>27}\n" \
               "degreeName:\t{s.degreeName:>28}\n" \
               "startDate:\t{s.startDate:>28}\n" \
               "endDate:\t{s.endDate:>28}\n".format(s=self)

逐行读取文件:

with open('<path to file>') as fh:
    degreeUrn = {}
    for line in fh:
        match = re.findall(r'\{(.*?)\:\[(.*?)\],(.*)\}', line)
        m2 = match[0][2].split(',')
        school = School(m2)
        if hasattr(school, 'entity'):
            if hasattr(school, 'startDate'):
                degreeUrn[school.entity].startDate = school.startDate
                del school
            elif hasattr(school, 'endDate'):
                degreeUrn[school.entity].endDate = school.endDate
                del school
            elif hasattr(school, 'schoolName'):
                degreeUrn[school.entity] = school
        else:
            del school

for entity in degreeUrn:
    print(degreeUrn[entity])

Output:

entity:                         75863717
schoolName: Charles University in Prague
fieldOfStudy:       Economics,  Politics
degreeName:                      unknown
startDate:                          2007
endDate:                            2007

entity:                         26812055
schoolName:           Occidental College
fieldOfStudy:                  Economics
degreeName:                           BA
startDate:                          2004
endDate:                            2008

用Python:3.4.2测试

相关问题 更多 >