从文本Python获取书目列表及其计数

2024-04-25 20:20:47 发布

您现在位置:Python中文网/ 问答频道 /正文

在我的python任务中,我必须阅读一篇PDF文档,并获取所有引用及其计数(在本文中提到)This is the PDF as example它有18个参考文献,说参考文献1在论文中被提到了3次,参考文献2被提到了1次,所以这就是我想要的

Ref#  Count   Reference 
 1     3      Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358.
 2      1     Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John arroll, editor, Workshop on Robust Parsing, pages 54-69, Prague
 ...

我已经完成了列表中的Ref#和References,并且通过使用以下正则表达式,设法从包含引用的文本中获取行:

regex = re.compile(r'[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}; [A-Za-z \u0000-\u007F,;]*\)|[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4},[A-Za-z0-9\u0000-\u007F ]*\)|[A-Z]{1}[a-z\u0000-\u007F ]+ [a-z]{2} [a-z]{2}. \([0-9]{4}\)')

所以,当我遍历字符串列表(由句子分割的文本)并使用以下代码查找上正则表达式时:

for i in range(0, len(lstString)):
    refLine = re.findall(regex, lstString[i])
    if(refLine != [] and refLine [0] != []):
        print(refLine)

我得到如下输出:

    (Karls- son et al., 1995)
    Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson
(1990)
    (Tapanainen, 1996)
    (Tapanainen, 1996) is dif- ferent from the former (Karlsson et al., 1995)
    Hurskainen (1996)
    In essence, the same formalism is used in the syn- tactic analysis in J~rvinen (1994) and     Anttila (1995)
    Our notation follows the classical model of depen- dency theory (Heringer, 1993) introduced by Lucien Tesni~re (1959) and later
advocated by Igor Mel'~uk (1987)
    Hudson (1991)
    (Hays, 1964)
    (McCord, 1990; Sleator and Tem- perley, 1991; Eisner, 1996)
    (Hudson, 1991)
    (J~irvinen, 1994)
    The CG-2 program (Tapanainen, 1996) runs a mod- ified disambiguation grammar of Voutilainen (1995)
    (J~rvinen, 1994; Tapanainen and J/~rvinen, 1994)
    (Eisner, 1996)
    Dekang Lin (1996)
    Acknowledgments We are using Atro Voutilainen's (1995)

它返回所有包含引用的字符串,但我遇到了如下问题

  1. It is not capturing Reference like this Karlsson et al. (1995)
  2. Some of these contains 2 reference in them
  3. How can I update count for each reference in reference list

我尝试了这段代码来获取每个ref的count,但它总是返回整个列表

matching = [s for s in lstRef if any(xs in s for xs in refLine)]

任何形式的帮助都将不胜感激


Tags: andoftheinforis参考文献et
1条回答
网友
1楼 · 发布于 2024-04-25 20:20:47

我想知道,如果从文档末尾的References中获取名称(和年份),并使用它们来搜索文档中的引用,该怎么办

在上一个问题中,您得到的代码在文档末尾得到References

使用regex '((.*)\. (\d{4})\.我可以将名称作为一个字符串获取,将年份作为一个字符串获取(最终将两者都作为一个字符串获取)

    authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
    text, authors, year = authors_and_year.groups()

   text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
   year: 1996

使用next regex',[ ]*and |,[ ]*| and '我可以将具有名称的字符串拆分为名称列表

    names = re.split(',[ ]*and |,[ ]*| and ', authors)

使用普通的split(" ")我可以得到比全名更有用的姓氏(姓氏)

    names = [(name, name.split(' ')[-1]) for name in names]

names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]

现在我可以使用这些名字(或者更确切地说是姓氏)和年份来生成像surname (year)surname, year这样的字符串,然后在文档中搜索

如果有很多姓氏,那么我可以得到第一个姓氏并生成surname et al. (year),等等

使用这些字符串和标准字符串函数text.count(generated_string)我可以计算它们

目前这是我的全部,但仍然不理想

您可以手动查找文档中的所有引用,并使用它们来测试代码。你会看到哪一个被正确计算,哪一个需要更多的改变

例如,在文本We are using Atro Voutilainen's (1995)中有对's的引用。也许应该使用nltkNLP(自然语言处理)中那样清理文档

一些本机字符产生问题-名称Järvinen在一个位置提取为J~rvinen,在另一个位置提取为J/irvinen

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

#  - functions  -

def myExtractText(self, distance=None):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
    
    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    prev_x = 0
    prev_y = 0
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"
            
        if operator == b_("Tm"):
        
            if distance is True: 
                text += '\n'
                
            elif isinstance(distance, int):
                x = operands[-2]
                y = operands[-1]

                diff_x = prev_x - x
                diff_y = prev_y - y

                #print('>>>', diff_x, diff_y - y)
                #text += f'| {diff_x}, {diff_y - y} |'
                
                if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)
                    text += '\n'
                    #text += '\n' # to add empty line between elements
                    
                prev_x = x
                prev_y = y
            
    return text
        
#  - main  -
        
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    #text += myExtractText(page)        # modified function (works like original version)
    #text += myExtractText(page, True)  # modified function (add `\n` after every `Tm`)
    text += myExtractText(page, 17)  # modified function (add `\n` only if distance is bigger then `17`)   

# get only text after word `References`
pos = text.lower().find('references')

# only referencers as text
references = text[pos+len('references '):]

# doc without references
doc = text[:pos]

# referencers as list
references = references.split('\n')

# remove empty lines and lines which have 2 chars (ie. page number)
references = [item.strip() for item in references if len(item.strip()) > 2]

print('\n - names  -\n')

data = []

for nubmer, line in enumerate(references, 1): # skip last element with page number
    line = line.strip()
    if line:  # skip empty line
    
        authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
        text, authors, year = authors_and_year.groups()
        #print(text, '|', authors, '|', year)
        
        names = re.split(',[ ]*and |,[ ]*| and ', authors)
        #print(names)
        
        # [(name, last_name), ...]
        names = [(name, name.split(' ')[-1]) for name in names]
        #print(names)
        
        #print(' line:', line)
        print('   text:', text)
        print('authors:', authors)
        print('   year:', year)
        print('  names:', names)
        print(' -')
        data.append((authors, names, year))

print('\n - counting  -\n')

# https://guides.lib.monash.edu/citing-referencing/APA-In-text
# Tapanainen and J/~rvine, 

for authors, names, year in data:
    print('authors:', authors)
    print('   year:', year)
    print('  names:', names)
    print(' et al.:', len(names) > 1)
    print('   and :', len(names) == 2)
    print(' -')
    first_lastname = names[0][-1]
    print(doc.count(first_lastname), first_lastname)
    print(doc.count(first_lastname + ', ' + year), first_lastname + ', ' + year)
    print(doc.count(first_lastname + ' (' + year + ')'), first_lastname + ' (' + year + ')')
    
    if len(names) > 1:
        first_lastname_et_al = first_lastname + ' et al.'
        print(doc.count(first_lastname_et_al), first_lastname_et_al)
        print(doc.count(first_lastname_et_al + ', ' + year), first_lastname_et_al + ', ' + year)
        print(doc.count(first_lastname_et_al + ' (' + year + ')'), first_lastname_et_al + ' (' + year + ')')

    if len(names) == 2:
        all_lastnames = ' and '.join(item[-1] for item in names)
        print(doc.count(all_lastnames), all_lastnames)
        print(doc.count(all_lastnames + ', ' + year), all_lastnames + ', ' + year)
        print(doc.count(all_lastnames + ' (' + year + ')'), all_lastnames + ' (' + year + ')')

    print('     ')

名称提取的结果:

 - names  -

   text: Arto Anttila. 1995.
authors: Arto Anttila
   year: 1995
  names: [('Arto Anttila', 'Anttila')]
 -
   text: Dekang Lin. 1996.
authors: Dekang Lin
   year: 1996
  names: [('Dekang Lin', 'Lin')]
 -
   text: Jason M. Eisner. 1996.
authors: Jason M. Eisner
   year: 1996
  names: [('Jason M. Eisner', 'Eisner')]
 -
   text: David G. Hays. 1964.
authors: David G. Hays
   year: 1964
  names: [('David G. Hays', 'Hays')]
 -
   text: Hans Jiirgen Heringer. 1993.
authors: Hans Jiirgen Heringer
   year: 1993
  names: [('Hans Jiirgen Heringer', 'Heringer')]
 -
   text: Richard Hudson. 1991.
authors: Richard Hudson
   year: 1991
  names: [('Richard Hudson', 'Hudson')]
 -
   text: Arvi Hurskainen. 1996.
authors: Arvi Hurskainen
   year: 1996
  names: [('Arvi Hurskainen', 'Hurskainen')]
 -
   text: Time J~rvinen. 1994.
authors: Time J~rvinen
   year: 1994
  names: [('Time J~rvinen', 'J~rvinen')]
 -
   text: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995.
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
   year: 1995
  names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
 -
   text: Fred Karlsson. 1990.
authors: Fred Karlsson
   year: 1990
  names: [('Fred Karlsson', 'Karlsson')]
 -
   text: Michael McCord. 1990.
authors: Michael McCord
   year: 1990
  names: [('Michael McCord', 'McCord')]
 -
   text: Igor A. Mel'~uk. 1987.
authors: Igor A. Mel'~uk
   year: 1987
  names: [("Igor A. Mel'~uk", "Mel'~uk")]
 -
   text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
   year: 1996
  names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
 -
   text: Daniel Sleator and Davy Temperley. 1991.
authors: Daniel Sleator and Davy Temperley
   year: 1991
  names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
 -
   text: Pasi Tapanainen and Time J/irvinen. 1994.
authors: Pasi Tapanainen and Time J/irvinen
   year: 1994
  names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
 -
   text: Pasi Tapanainen. 1996.
authors: Pasi Tapanainen
   year: 1996
  names: [('Pasi Tapanainen', 'Tapanainen')]
 -
   text: Lucien TesniSre. 1959.
authors: Lucien TesniSre
   year: 1959
  names: [('Lucien TesniSre', 'TesniSre')]
 -
   text: Atro Voutilainen. 1995.
authors: Atro Voutilainen
   year: 1995
  names: [('Atro Voutilainen', 'Voutilainen')]
 -

计数结果:

 - counting  -

authors: Arto Anttila
   year: 1995
  names: [('Arto Anttila', 'Anttila')]
 et al.: False
   and : False
 -
1 Anttila
0 Anttila, 1995
1 Anttila (1995)
     
authors: Dekang Lin
   year: 1996
  names: [('Dekang Lin', 'Lin')]
 et al.: False
   and : False
 -
4 Lin
0 Lin, 1996
1 Lin (1996)
     
authors: Jason M. Eisner
   year: 1996
  names: [('Jason M. Eisner', 'Eisner')]
 et al.: False
   and : False
 -
2 Eisner
2 Eisner, 1996
0 Eisner (1996)
     
authors: David G. Hays
   year: 1964
  names: [('David G. Hays', 'Hays')]
 et al.: False
   and : False
 -
1 Hays
1 Hays, 1964
0 Hays (1964)
     
authors: Hans Jiirgen Heringer
   year: 1993
  names: [('Hans Jiirgen Heringer', 'Heringer')]
 et al.: False
   and : False
 -
1 Heringer
1 Heringer, 1993
0 Heringer (1993)
     
authors: Richard Hudson
   year: 1991
  names: [('Richard Hudson', 'Hudson')]
 et al.: False
   and : False
 -
2 Hudson
1 Hudson, 1991
1 Hudson (1991)
     
authors: Arvi Hurskainen
   year: 1996
  names: [('Arvi Hurskainen', 'Hurskainen')]
 et al.: False
   and : False
 -
1 Hurskainen
0 Hurskainen, 1996
1 Hurskainen (1996)
     
authors: Time J~rvinen
   year: 1994
  names: [('Time J~rvinen', 'J~rvinen')]
 et al.: False
   and : False
 -
2 J~rvinen
1 J~rvinen, 1994
1 J~rvinen (1994)
     
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
   year: 1995
  names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
 et al.: True
   and : False
 -
3 Karlsson
0 Karlsson, 1995
0 Karlsson (1995)
2 Karlsson et al.
1 Karlsson et al., 1995
1 Karlsson et al. (1995)
     
authors: Fred Karlsson
   year: 1990
  names: [('Fred Karlsson', 'Karlsson')]
 et al.: False
   and : False
 -
3 Karlsson
0 Karlsson, 1990
1 Karlsson (1990)
     
authors: Michael McCord
   year: 1990
  names: [('Michael McCord', 'McCord')]
 et al.: False
   and : False
 -
1 McCord
1 McCord, 1990
0 McCord (1990)
     
authors: Igor A. Mel'~uk
   year: 1987
  names: [("Igor A. Mel'~uk", "Mel'~uk")]
 et al.: False
   and : False
 -
1 Mel'~uk
0 Mel'~uk, 1987
1 Mel'~uk (1987)
     
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
   year: 1996
  names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
 et al.: True
   and : False
 -
1 Samuelsson
0 Samuelsson, 1996
0 Samuelsson (1996)
1 Samuelsson et al.
0 Samuelsson et al., 1996
1 Samuelsson et al. (1996)
     
authors: Daniel Sleator and Davy Temperley
   year: 1991
  names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
 et al.: True
   and : True
 -
1 Sleator
0 Sleator, 1991
0 Sleator (1991)
0 Sleator et al.
0 Sleator et al., 1991
0 Sleator et al. (1991)
0 Sleator and Temperley
0 Sleator and Temperley, 1991
0 Sleator and Temperley (1991)
     
authors: Pasi Tapanainen and Time J/irvinen
   year: 1994
  names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
 et al.: True
   and : True
 -
6 Tapanainen
0 Tapanainen, 1994
0 Tapanainen (1994)
0 Tapanainen et al.
0 Tapanainen et al., 1994
0 Tapanainen et al. (1994)
0 Tapanainen and J/irvinen
0 Tapanainen and J/irvinen, 1994
0 Tapanainen and J/irvinen (1994)
     
authors: Pasi Tapanainen
   year: 1996
  names: [('Pasi Tapanainen', 'Tapanainen')]
 et al.: False
   and : False
 -
6 Tapanainen
3 Tapanainen, 1996
0 Tapanainen (1996)
     
authors: Lucien TesniSre
   year: 1959
  names: [('Lucien TesniSre', 'TesniSre')]
 et al.: False
   and : False
 -
0 TesniSre
0 TesniSre, 1959
0 TesniSre (1959)
     
authors: Atro Voutilainen
   year: 1995
  names: [('Atro Voutilainen', 'Voutilainen')]
 et al.: False
   and : False
 -
3 Voutilainen
0 Voutilainen, 1995
1 Voutilainen (1995)
     

相关问题 更多 >