MIAS搜索包实现了赢得NTCIR-11 Math-2主要任务的数学信息检索系统(R_i_ka等人,2014)。

ntcir-mias-search的Python项目详细描述


ntcir mias search–我们的ntcir数学任务搜索引擎

CircleCI

ntcir mias search是一个python 3命令行实用程序,它在 WebMIaS实现了数学信息检索系统 ntcir-11 math-2的主要任务(参见task paper,以及 这是system description paper)。

实验上,ntcir mias搜索还根据 来自NTCIR Math Density Estimator包的相关概率估计。

用法

安装

可以通过执行以下命令来安装包:

$ pip install ntcir-mias-search

显示用法

可以通过执行以下命令来显示包的使用信息 命令:

$ ntcir-mias-search --help
usage: ntcir-mias-search [-h] --dataset DATASET --topics TOPICS --positions
                         POSITIONS --estimates ESTIMATES --webmias-url
                         WEBMIAS_URL
                         [--webmias-index-number WEBMIAS_INDEX_NUMBER]
                         [--num-workers-querying NUM_WORKERS_QUERYING]
                         [--num-workers-merging NUM_WORKERS_MERGING]
                         --output-directory OUTPUT_DIRECTORY

Use topics in the NTCIR-10 Math, NTCIR-11 Math-2, and NTCIR-12 MathIR format
to query the WebMIaS interface of the MIaS Math Information Retrieval system
and to retrieve result document lists.

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET     A path to a directory containing a dataset in the
                        NTCIR-11 Math-2, and NTCIR-12 MathIR XHTML5 format.
                        The directory does not need to exist, since the path
                        is only required for extracting data from the file
                        with estimated positions of paragraph identifiers.
  --topics TOPICS       A path to a file containing topics in the NTCIR-10
                        Math, NTCIR-11 Math-2, and NTCIR-12 MathIR format.
  --positions POSITIONS 
                        The path to the file, where the estimated positions of
                        all paragraph identifiers from our dataset were stored
                        by the NTCIR Math Density Estimator package.
  --estimates ESTIMATES 
                        The path to the file, where the density, and
                        probability estimates for our dataset were stored by
                        the NTCIR Math Density Estimator package.
  --webmias-url WEBMIAS_URL
                        The URL at which a WebMIaS Java Servlet has been
                        deployed.
  --webmias-index-number WEBMIAS_INDEX_NUMBER
                        The numeric identifier of the WebMIaS index that
                        corresponds to the dataset. Defaults to 0.
  --num-workers-querying NUM_WORKERS_QUERYING
                        The number of processes that will send queries to
                        WebMIaS. Defaults to 1. Note that querying, reranking,
                        and merging takes place simmultaneously.
  --num-workers-merging NUM_WORKERS_MERGING
                        The number of processes that will rerank results.
                        Defaults to 3. Note that querying, reranking, and
                        merging takes place simmultaneously.
  --output-directory OUTPUT_DIRECTORY
                        The path to the directory, where the output files will
                        be stored.
  --plots PLOTS [PLOTS ...]
                        The path to the files, where the evaluation results
                        will plotted.

查询webmias

以下命令使用64工作者查询本地webmias实例 进程:

$ mkdir search_results

$ ntcir-mias-search --num-workers-querying 8 --num-workers-merging 56 \
>     --dataset ntcir-11-12 \
>     --topics NTCIR11-Math2-queries-participants.xml \
>     --judgements NTCIR11_Math-qrels.dat \
>     --estimates estimates.pkl.gz --positions positions.pkl.gz \
>     --webmias-url http://localhost:58080/WebMIaS --webmias-index-number 1 \
>     --plots plot.pdf plot.svg \
>     --output-directory search_results
Reading relevance judgements from NTCIR11_Math-qrels.dat
50 judged topics and 2500 total judgements in NTCIR11_Math-qrels.dat
Reading topics from NTCIR11-Math2-queries-participants.xml
50 topics (NTCIR11-Math-1, NTCIR11-Math-2, ...) contain 55 formulae, and 113 keywords
Establishing connection with a WebMIaS Java Servlet at http://localhost:58080/WebMIaS
Reading paragraph position estimates from positions.pkl.gz
8301578 total paragraph identifiers in positions.pkl.gz
Reading density, and probability estimates from estimates.pkl.gz
Querying WebMIaSIndex(http://localhost:58080/WebMIaS, 1), reranking and merging results
Using 3 strategies to aggregate MIaS scores with probability estimates:
- The best possible score that uses relevance judgements (look for 'best' in filenames)
- The original MIaS score with the probability estimate discarded (look for 'orig' in filenames)
- The worst possible score that uses relevance judgements (look for 'worst' in filenames)
Storing reranked per-query result lists in search_results
Using 4 formats to represent mathematical formulae in queries:
- Content MathML XML language (look for 'CMath' in filenames)
- Combined Presentation and Content MathML XML language (look for 'PCMath' in filenames)
- Presentation MathML XML language (look for 'PMath' in filenames)
- The TeX language by professor Knuth (look for 'TeX' in filenames)
Result list for topic NTCIR11-Math-9 contains only 188 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-17 contains only 716 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-26 contains only 518 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-39 contains only 419 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-43 contains only 924 / 1000 results, sampling the dataset
get_results:  100%|███████████████████████████████████████████████| 50/50 [00:26<00:00,  1.88it/s]
rerank_and_merge_results: 200it [01:02,  3.18it/s]
Storing final result lists in mias_search_results
100%|█████████████████████████████████████████████████████████████| 12/12 [00:13<00:00,  3.73it/s]
Evaluation results:
- best, PCMath: 0.5569
- best, PMath: 0.5283
- best, TeX: 0.5076
- best, CMath: 0.4983
- orig, PCMath: 0.4917
- ...
- orig, PMath: 0.4616
- worst, CMath: 0.3080
- worst, TeX: 0.2810
- worst, PMath: 0.1156
- worst, PCMath: 0.1141
Plotting plot.svg
Plotting plot.pdf

$ ls search_results
final_CMath.best.tsv
final_CMath.orig.tsv
final_CMath.worst.tsv
final_PCMath.best.tsv
final_PCMath.orig.tsv
final_PCMath.worst.tsv
final_PMath.best.tsv
final_PMath.orig.tsv
final_PMath.worst.tsv
final_TeX.best.tsv
final_TeX.orig.tsv
final_TeX.worst.tsv
NTCIR11-Math-10_CMath.1.query.txt
NTCIR11-Math-10_CMath.1.response.xml
NTCIR11-Math-10_CMath.1.results.best.tsv
NTCIR11-Math-10_CMath.1.results.orig.tsv
NTCIR11-Math-10_CMath.1.results.worst.tsv
NTCIR11-Math-10_CMath.2.query.txt
NTCIR11-Math-10_CMath.2.response.xml
...

下面的命令使用 64个工作进程:

$ mkdir search_results

$ ntcir-mias-search --num-workers-querying 8 --num-workers-merging 56 \
>     --dataset ntcir-11-12 \
>     --topics NTCIR11-Math2-queries-participants.xml \
>     --judgements NTCIR11_Math-qrels.dat \
>     --estimates estimates.pkl.gz --positions positions.pkl.gz \
>     --webmias-url https://mir.fi.muni.cz/webmias-demo --webmias-index-number 0 \
>     --plots plot.pdf plot.svg \
>     --output-directory search_results
Reading relevance judgements from NTCIR11_Math-qrels.dat
50 judged topics and 2500 total judgements in NTCIR11_Math-qrels.dat
Reading topics from NTCIR11-Math2-queries-participants.xml
50 topics (NTCIR11-Math-1, NTCIR11-Math-2, ...) contain 55 formulae, and 113 keywords
Establishing connection with a WebMIaS Java Servlet at https://mir.fi.muni.cz/webmias-demo
Reading paragraph position estimates from positions.pkl.gz
8301578 total paragraph identifiers in positions.pkl.gz
Reading density, and probability estimates from estimates.pkl.gz
Querying WebMIaSIndex(https://mir.fi.muni.cz/webmias-demo, 0), reranking and merging results
Using 3 strategies to aggregate MIaS scores with probability estimates:
- The best possible score that uses relevance judgements (look for 'best' in filenames)
- The original MIaS score with the probability estimate discarded (look for 'orig' in filenames)
- The worst possible score that uses relevance judgements (look for 'worst' in filenames)
Storing reranked per-query result lists in search_results
Using 4 formats to represent mathematical formulae in queries:
- Content MathML XML language (look for 'CMath' in filenames)
- Combined Presentation and Content MathML XML language (look for 'PCMath' in filenames)
- Presentation MathML XML language (look for 'PMath' in filenames)
- The TeX language by professor Knuth (look for 'TeX' in filenames)
get_results:  100%|███████████████████████████████████████████████| 50/50 [05:29<00:00,  6.58s/it]
rerank_and_merge_results: 200it [06:57,  2.09s/it]
Storing final result lists in mias_search_results
100%|█████████████████████████████████████████████████████████████| 12/12 [00:13<00:00,  3.73it/s]
Evaluation results:
- best, PCMath: 0.5569
- best, PMath: 0.5283
- best, TeX: 0.5076
- best, CMath: 0.4983
- orig, PCMath: 0.4917
- ...
- orig, PMath: 0.4616
- worst, CMath: 0.3080
- worst, TeX: 0.2810
- worst, PMath: 0.1156
- worst, PCMath: 0.1141
Plotting plot.svg
Plotting plot.pdf

$ ls search_results
final_CMath.best.tsv
final_CMath.orig.tsv
final_CMath.worst.tsv
final_PCMath.best.tsv
final_PCMath.orig.tsv
final_PCMath.worst.tsv
final_PMath.best.tsv
final_PMath.orig.tsv
final_PMath.worst.tsv
final_TeX.best.tsv
final_TeX.orig.tsv
final_TeX.worst.tsv
NTCIR11-Math-10_CMath.1.query.txt
NTCIR11-Math-10_CMath.1.response.xml
NTCIR11-Math-10_CMath.1.results.best.tsv
NTCIR11-Math-10_CMath.1.results.orig.tsv
NTCIR11-Math-10_CMath.1.results.worst.tsv
NTCIR11-Math-10_CMath.2.query.txt
NTCIR11-Math-10_CMath.2.response.xml
...

贡献

要熟悉代码库,请参考 Umbrello项目文档project.xmi

Rendered UML class diagram

引用NTCIR MIAS搜索

文本

R_i_ka、Michal、Petr Sojka和Martin L_什卡。数学索引器和搜索器 胡德:胜利战略的历史和发展。在神田北野 佐贺,岸田克也。第11届全国学生委员会评价会议记录 信息存取技术。东京:国家信息学研究所, 2-1-2 Hitotsubashi,Chiyoda Ku,东京101-8430,日本,2014年。第127-134页,第8页。 国际标准书号978-4-86049-065-2。

bibtex

@inproceedings{mir:MIaSNTCIR-11,author="Michal R\r{u}\v{z}i\v{c}ka and Petr Sojka and Michal L{\' i}\v{s}ka",title="{Math Indexer and Searcher under the Hood:               History and Development of a Winning Strategy}",month=Dec,year=2014,address="Tokyo",booktitle="{Proc. of the 11th NTCIR Conference on Evaluation               of Information Access Technologies}",editor="Hideo Joho and Kazuaki Kishida",publisher="{NII, Tokyo, Japan}",pages="127--134",}

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
带有接口参数的java反射和构造函数   java Spring Thymeleaf如何通过Thymeleaf表单编辑(更新)用户选择的实体(对象)?   Java Tapestry中不同类之间的静态、非静态和调用   java如何使用dojo 1.9显示地理地图   安卓如何在java中简化这个代码片段?   尝试将java spring应用程序部署到Oracle weblogic群集java时出错。lang.ArrayIndexOutofBounds异常:52304   java如何在标头中包含SOAP身份验证详细信息?   java使用流删除一个列表中的元素(如果存在于另一个列表中)   java如何将包含UTC时间的字符串(如“193526”(19:35:26)转换为本地时区?   java部署Grailsgenerated WAR文件的最简单方法是什么?   java使用两种类型向通用列表添加对象   java如何在安卓应用程序中保存应用程序数据?   java有人能帮我弄清楚如何从“:”(不包括)读取我的文件直到行尾吗?   java在org上找不到javadoc。日食团队svn anywhere