我正在使用Terrier-IR平台对包含280万个XML文档的社会图书搜索数据集进行实验,每个文档都有超过67个元数据字段。下面给出了一个示例XML文件:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- version 1.0 / 2009-11-06T15:56:12+01:00 -->
<!DOCTYPE book SYSTEM "books.dtd">
<book>
<isbn>0373078005</isbn>
<title>Never Trust A Lady (Silhouette Intimate Moments, No 800) (Harlequin Intimate Moments, No 800)</title>
<ean>9780373078004</ean>
<binding>Paperback</binding>
<label>Silhouette</label>
<browseNode id="388186011">Refinements</browseNode>
<browseNode id="394174011">Binding (binding)</browseNode>
<browseNode id="400272011">Paperback</browseNode>
</browseNodes>
</book>
但是,在索引之前,我想将集合转换为TREC集合格式。应将特定文件夹中的所有XML文件转换为单个TREC文件,示例如下:
<book>
<isbn>0373078005</isbn>
<text>0373078005 Never Trust A Lady (Silhouette Intimate Moments, No 800 (Harlequin Intimate Moments, No 800) 9780373078004 Paperback Silhouette $3.99 Silhouette Silhouette 1997-07-01 Silhouette Refinements Binding (binding) Paperback </text>
</book>
<book>
<isbn>0373084005</isbn>
<text>0373084005 Written On The Wind (Silhouette Romance, No 400) 9780373084005 Paperback Silhouette $1.95 Silhouette Silhouette 1985-11-01 Silhouette 70 420 650 10 Rita Rainville Author Artificial intellingence Romance contemporary sr category Romance Subjects Contemporary Series Silhouette Romance Books General Refinements Binding (binding) Paperback Format (feature_browse-bin) Printed Books General AAS</text>
</book>
...
我创建了C:\xmlfiles\python-trec
,并在其中创建了两个文件夹,即data1
和data2
,并在这两个文件夹中放置了一些xml文件。我使用了一个python脚本:http:lab.hypothesis.org/1129,我修改为:
import os, sys
from bs4 import BeautifulSoup
datadest="no collection path"
datdir = "C:\\xmlfiles\\python-trec\\"
for folds in os.listdir(datdir):
os.mkdir(datadest+folds)
trectxt=""
for files in os.listdir(datdir+folds):
if files.endswith(".xml"):
content= open(datdir+"/"+folds+"/"+files,'r').read()
soup = BeautifulSoup(content)
texts = soup.findAll("book")
for text in texts:
isbn =texts[0].findAll("isbn")[0].getText()
trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n"
trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n"
f=open(datadest+folds+"/"+folds+".xml","w")
f.write(trectxt)
f.close()
我收到以下错误消息:
C:\Python27>python C:\Python27\Scripts\trec-conversion.py
Traceback (most recent call last):
File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module>
os.mkdir(datadest+folds)
WindowsError: [Error 183] Cannot create a file when that file already exists: 'no collection pathdata1'
将行:datadest="no collection path"
修改为datadest="C:\\xmlfiles\\python-trec\\"
后,我得到以下错误消息:
C:\Python27>python C:\Python27\Scripts\trec-conversion.py
Traceback (most recent call last):
File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module>
os.mkdir(datadest+folds)
WindowsError: [Error 183] Cannot create a file when that file already exists: 'C:\\xmlfiles\\python-trec\\data1'
然后,我创建了一个新文件夹C:\\xmlfiles\\python-trec\\python-trec-results
,并将行datadest="no collection path"
修改为datadest="C:\\xmlfiles\\python-trec\\python-trec-results"
,得到以下错误消息:
C:\Python27\Scripts\trec-conversion.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 11 of the file
C:\Python27\Scripts\trec-conversion.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = BeautifulSoup(content)
Traceback (most recent call last):
File "C:\Python27\Scripts\trec-conversion.py", line 18, in <module>
f.write(trectxt)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1141: ordinal not in range(128)
代码为data1文件夹生成所需的TREC文件,但无法为data2文件夹生成与上述消息相同的TREC文件。你知道吗
请帮忙
——洛基
我做了以下更改:
现在程序开始工作了! 但是,它在和节点的值内提供了太多额外的空格,如下所示:
我要删除不必要的空白并返回,使其看起来如下所示:
我尝试过删除空白的可用答案,但它们不适合我。。。 请帮忙。你知道吗
相关问题 更多 >
编程相关推荐