如何将social book search XML集合转换为TREC集合？

<?xml version="1.0" encoding="ISO-8859-1"?>  <!DOCTYPE book SYSTEM "books.dtd"> <book> <isbn>0373078005</isbn> <title>Never Trust A Lady (Silhouette Intimate Moments, No 800) (Harlequin Intimate Moments, No 800)</title> <ean>9780373078004</ean> <binding>Paperback</binding> <label>Silhouette</label> <browseNode id="388186011">Refinements</browseNode> <browseNode id="394174011">Binding (binding)</browseNode> <browseNode id="400272011">Paperback</browseNode> </browseNodes> </book>

<book> <isbn>0373078005</isbn> <text>0373078005 Never Trust A Lady (Silhouette Intimate Moments, No 800 (Harlequin Intimate Moments, No 800) 9780373078004 Paperback Silhouette $3.99 Silhouette Silhouette 1997-07-01 Silhouette Refinements Binding (binding) Paperback </text> </book> <book> <isbn>0373084005</isbn> <text>0373084005 Written On The Wind (Silhouette Romance, No 400) 9780373084005 Paperback Silhouette $1.95 Silhouette Silhouette 1985-11-01 Silhouette 70 420 650 10 Rita Rainville Author Artificial intellingence Romance contemporary sr category Romance Subjects Contemporary Series Silhouette Romance Books General Refinements Binding (binding) Paperback Format (feature_browse-bin) Printed Books General AAS</text> </book> ...

import os, sys from bs4 import BeautifulSoup datadest="no collection path" datdir = "C:\\xmlfiles\\python-trec\\" for folds in os.listdir(datdir): os.mkdir(datadest+folds) trectxt="" for files in os.listdir(datdir+folds): if files.endswith(".xml"): content= open(datdir+"/"+folds+"/"+files,'r').read() soup = BeautifulSoup(content) texts = soup.findAll("book") for text in texts: isbn =texts[0].findAll("isbn")[0].getText() trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n" trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n" f=open(datadest+folds+"/"+folds+".xml","w") f.write(trectxt) f.close()

C:\Python27>python C:\Python27\Scripts\trec-conversion.py Traceback (most recent call last): File "C:\Python27\Scripts\trec-conversion.py", line 6, in <module> os.mkdir(datadest+folds) WindowsError: [Error 183] Cannot create a file when that file already exists: 'no collection pathdata1'

C:\Python27\Scripts\trec-conversion.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 11 of the file C:\Python27\Scripts\trec-conversion.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor. soup = BeautifulSoup(content) Traceback (most recent call last): File "C:\Python27\Scripts\trec-conversion.py", line 18, in <module> f.write(trectxt) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1141: ordinal not in range(128)

1条回答

网友

1楼 · 发布于 2024-06-08 08:42:46

我做了以下更改：

# encoding=utf8
import os, sys
reload(sys)
sys.setdefaultencoding('utf8')

from bs4 import BeautifulSoup

datadest="C:\\xmlfiles\\python-trec-results\\"
datdir = "C:\\xmlfiles\\python-trec\\"

for folds in os.listdir(datdir):
    os.mkdir(datadest+folds)
    trectxt=""
    for files in os.listdir(datdir+folds):
        if files.endswith(".xml"):
            content= open(datdir+"/"+folds+"/"+files,'r').read()
            soup = BeautifulSoup(content, 'lxml', from_encoding='utf-8')
            texts = soup.findAll("book")
            for text in texts:
                isbn =texts[0].findAll("isbn")[0].getText()
                trectxt+="<book>\n<isbn>"+isbn+"</isbn>\n"
                trectxt+="<text>"+' '.join(texts[0].findAll(text=True))+"</text>\n</book>\n"
                f=open(datadest+folds+"/"+folds+".xml","w")
                f.write(trectxt)
                f.close()

现在程序开始工作了！但是，它在和节点的值内提供了太多额外的空格，如下所示：

<book>
<isbn>0268020000</isbn>
<text>
0268020000 
Aquinas On Matter and Form and the Elements: A Translation and Interpretation of the DE PRINCIPIIS NATURAE and the DE MIXTIONE ELEMENTORUM of St. Thomas Aquinas 
9780268020002 
Paperback 
University of Notre Dame Press 
$25.00 
University of Notre Dame Press 
University of Notre Dame Press 


1998-03-28 
University of Notre Dame Press 

2000-11-16 
Wonderful Exposition 
Bobick has done it again.  After reading Bobick's insightful translation and exposition of Aquinas' "De Ente et Esentia", I was pleased to find that his knack for explaining Aquinas' complex ideas in metaphysics and natural philospohy is repeated in this book.  For those who wish to understand Aquinas in depth, this book is a must. 
5 
0 
0 

Physics 
Cosmology 
Professional & Technical 



</text>
</book>
<book>
<isbn>0268037000</isbn>
<text>
0268037000
...

我要删除不必要的空白并返回，使其看起来如下所示：

<book>
<isbn>0268020000</isbn>
<text> ....text goes here....</text>
</book>
<book>
<isbn> 0268037000 </isbn>
<text>....text goes here.....</text>
</book>
...

我尝试过删除空白的可用答案，但它们不适合我。。。请帮忙。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章