如何将social book search XML集合转换为TREC集合?

2024-04-26 17:38:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Terrier-IR平台对包含280万个XML文档的社会图书搜索数据集进行实验,每个文档都有超过67个元数据字段。下面给出了一个示例XML文件:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- version 1.0 / 2009-11-06T15:56:12+01:00 -->
<!DOCTYPE book SYSTEM "books.dtd">
<book>
<isbn>0373078005</isbn>
<title>Never Trust A Lady (Silhouette Intimate Moments, No 800) (Harlequin Intimate Moments, No 800)</title>
<ean>9780373078004</ean>
<binding>Paperback</binding>
<label>Silhouette</label>
<browseNode id="388186011">Refinements</browseNode>
<browseNode id="394174011">Binding (binding)</browseNode>
<browseNode id="400272011">Paperback</browseNode>
</browseNodes>
</book>

但是,在索引之前,我想将集合转换为TREC集合格式。应将特定文件夹中的所有XML文件转换为单个TREC文件,示例如下:

<book>
<isbn>0373078005</isbn>
<text>0373078005 Never Trust A Lady (Silhouette Intimate Moments, No 800 (Harlequin Intimate Moments, No 800) 9780373078004 Paperback Silhouette $3.99 Silhouette Silhouette 1997-07-01 Silhouette Refinements Binding (binding) Paperback </text>
</book>
<book>
<isbn>0373084005</isbn>
<text>0373084005 Written On The Wind (Silhouette Romance, No 400) 9780373084005 Paperback Silhouette $1.95 Silhouette Silhouette 1985-11-01 Silhouette 70 420 650 10 Rita Rainville Author Artificial intellingence Romance contemporary sr category Romance Subjects Contemporary Series Silhouette Romance Books General Refinements Binding (binding) Paperback Format (feature_browse-bin) Printed Books General AAS</text>
</book>
...

我创建了C:\xmlfiles\python-trec,并在其中创建了两个文件夹,即data1data2,并在这两个文件夹中放置了一些xml文件。我使用了一个python脚本:http:lab.hypothesis.org/1129,我修改为:

import os, sys
from bs4 import BeautifulSoup
datadest="no collection path"
datdir = "C:\\xmlfiles\\python-trec\\"
for folds in os.listdir(datdir):
    os.mkdir(datadest+folds)
    trectxt=""
    for files in os.listdir(datdir+folds):
        if files.endswith(".xml"):
            content= open(datdir+"/"+folds+"/"+files,'r').read()
            soup = BeautifulSoup(content)
            texts = soup.findAll("book")
            for text in texts:
                isbn =texts[0].findAll("isbn")[0].getText()
                trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n"
                trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n"
                f=open(datadest+folds+"/"+folds+".xml","w")
                f.write(trectxt)
                f.close()

我收到以下错误消息:

C:\Python27>python C:\Python27\Scripts\trec-conversion.py
Traceback (most recent call last):
  File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module>
   os.mkdir(datadest+folds)
 WindowsError: [Error 183] Cannot create a file when that file already exists: 'no collection pathdata1'

将行:datadest="no collection path"修改为datadest="C:\\xmlfiles\\python-trec\\"后,我得到以下错误消息:

C:\Python27>python C:\Python27\Scripts\trec-conversion.py
Traceback (most recent call last):
  File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module>
   os.mkdir(datadest+folds)
WindowsError: [Error 183] Cannot create a file when that file already exists: 'C:\\xmlfiles\\python-trec\\data1'

然后,我创建了一个新文件夹C:\\xmlfiles\\python-trec\\python-trec-results,并将行datadest="no collection path"修改为datadest="C:\\xmlfiles\\python-trec\\python-trec-results",得到以下错误消息:

C:\Python27\Scripts\trec-conversion.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file 
C:\Python27\Scripts\trec-conversion.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

soup = BeautifulSoup(content)
Traceback (most recent call last):
File "C:\Python27\Scripts\trec-conversion.py", line 18, in <module>
    f.write(trectxt)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1141: ordinal not in range(128)

代码为data1文件夹生成所需的TREC文件,但无法为data2文件夹生成与上述消息相同的TREC文件。你知道吗

请帮忙

——洛基


Tags: 文件notextinpyscriptspython27isbn
1条回答
网友
1楼 · 发布于 2024-04-26 17:38:45

我做了以下更改:

# encoding=utf8
import os, sys
reload(sys)
sys.setdefaultencoding('utf8')

from bs4 import BeautifulSoup

datadest="C:\\xmlfiles\\python-trec-results\\"
datdir = "C:\\xmlfiles\\python-trec\\"

for folds in os.listdir(datdir):
    os.mkdir(datadest+folds)
    trectxt=""
    for files in os.listdir(datdir+folds):
        if files.endswith(".xml"):
            content= open(datdir+"/"+folds+"/"+files,'r').read()
            soup = BeautifulSoup(content, 'lxml', from_encoding='utf-8')
            texts = soup.findAll("book")
            for text in texts:
                isbn =texts[0].findAll("isbn")[0].getText()
                trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n"
                trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n"
                f=open(datadest+folds+"/"+folds+".xml","w")
                f.write(trectxt)
                f.close()

现在程序开始工作了! 但是,它在和节点的值内提供了太多额外的空格,如下所示:

<book>
<isbn>0268020000</isbn>
<text>
0268020000 
Aquinas On Matter and Form and the Elements: A Translation and Interpretation of the DE PRINCIPIIS NATURAE and the DE MIXTIONE ELEMENTORUM of St. Thomas Aquinas 
9780268020002 
Paperback 
University of Notre Dame Press 
$25.00 
University of Notre Dame Press 
University of Notre Dame Press 


1998-03-28 
University of Notre Dame Press 

2000-11-16 
Wonderful Exposition 
Bobick has done it again.  After reading Bobick's insightful translation and exposition of Aquinas' "De Ente et Esentia", I was pleased to find that his knack for explaining Aquinas' complex ideas in metaphysics and natural philospohy is repeated in this book.  For those who wish to understand Aquinas in depth, this book is a must. 
5 
0 
0 

Physics 
Cosmology 
Professional & Technical 



</text>
</book>
<book>
<isbn>0268037000</isbn>
<text>
0268037000
... 

我要删除不必要的空白并返回,使其看起来如下所示:

<book>
<isbn>0268020000</isbn>
<text> ....text goes here....</text>
</book>
<book>
<isbn> 0268037000 </isbn>
<text>....text goes here.....</text>
</book>
...

我尝试过删除空白的可用答案,但它们不适合我。。。 请帮忙。你知道吗

相关问题 更多 >