使用BeautifulSoup从JATS XML获取日期

2024-04-23 22:59:15 发布

您现在位置:Python中文网/ 问答频道 /正文

如何使用BeautifulSoup从jatsxml中提取日期(epub)?在

<pub-date pub-type="epub">
<day>12</day>
<month>09</month>
<year>2011</year>
</pub-date>

→2011年9月12日

^{pr2}$

应该被忽略。在


Tags: datetypeepubyearpubdaymonthbeautifulsoup
1条回答
网友
1楼 · 发布于 2024-04-23 22:59:15

在您的示例中,pub-type是pub date的一个属性,该属性的值是"epub"。{{a1}或者像XML}那样的XML}格式可以很好地使用。在

这里有两个函数使用lxml.etree,仅当属性为“epub”时才使用xpath解析候选日期字段。我是以PLOS的jatsxml格式为基础的,希望能在这里应用。在

import datetime
import lxml.etree as et

def parse_article_date(date_element, date_format='%Y %m %d'):
    """
    For an article date element, convert XML fields to a datetime object
    :param date_format: string format used to convert to datetime object
    :return: datetime object based on XML date fields
    """
    day = ''
    month = ''
    year = ''
    for item in date_element.getchildren():
        if item.tag == 'day':
            day = item.text
        if item.tag == 'month':
            month = item.text
        if item.tag == 'year':
            year = item.text
    date = (year, month, day)
    string_date = ' '.join(date)
    date = datetime.datetime.strptime(string_date, date_format)

    return date

def get_article_pubdate(article_file, tag_path_elements=None, string_=False):
    """
    For a local article file, get its date of publication
    :param article_file: the xml file for a single article
    :param tag_path_elements: xpath search results of the location in the article's XML tree
    :param string_: defaults to False. If True, returns a date string instead of datetime object
    :return: dict of date type mapped to datetime object for that article
    """
    pub_date = {}
    if tag_path_elements is None:
        tag_path_elements = ("/",
                             "article",
                             "front",
                             "article-meta",
                             "pub-date")

    article_tree = et.parse(article_file)
    article_root = article_tree.getroot()
    tag_location = '/'.join(tag_path_elements)
    pub_date_fields = article_root.xpath(tag_location)
    print(pub_date_fields)

    for element in pub_date_fields:
        pub_type = element.get('pub-type')
        if pub_type == 'epub':
            date = parse_article_date(element)
            pub_date[pub_type] = date

    if string_:
        for key, value in pub_date.items():
            if value:
                pub_date[key] = value.strftime('%Y-%m-%d')  # you can set this to any date format

    return pub_date

相关问题 更多 >