Swish-e is a fast, flexible, and free
open source system for indexing
collections of Web pages or other
files. Swish-e is ideally suited for
collections of a million documents or
smaller. Using the GNOME™ libxml2
parser and a collection of filters,
Swish-e can index plain text, e-mail,
PDF, HTML, XML, Microsoft®
Word/PowerPoint/Excel and just about
any file that can be converted to XML
or HTML text. Swish-e is also often
used to supplement databases like the
MySQL® DBMS for very fast full-text
searching.
看看PDFMiner。它可以很容易地做你想做的事。另外,请搜索类似的问题,因为这可能是重复的:Python module for converting PDF to text
我们使用Swish-e索引我们的网站,其中包括数以千计的PDF,Word文件,甚至WordPerfect文件。效果很好。它是免费的,开源的,与PHP很好的集成。在
http://swish-e.org/index.html
从他们的主页:
相关问题 更多 >
编程相关推荐