MIAS搜索包实现了赢得NTCIR-11 Math-2主要任务的数学信息检索系统(R_i_ka等人,2014)。
ntcir-mias-search的Python项目详细描述
ntcir mias search–我们的ntcir数学任务搜索引擎
ntcir mias search是一个python 3命令行实用程序,它在 WebMIaS实现了数学信息检索系统 ntcir-11 math-2的主要任务(参见task paper,以及 这是system description paper)。
实验上,ntcir mias搜索还根据 来自NTCIR Math Density Estimator包的相关概率估计。
用法
安装
可以通过执行以下命令来安装包:
$ pip install ntcir-mias-search
显示用法
可以通过执行以下命令来显示包的使用信息 命令:
$ ntcir-mias-search --help
usage: ntcir-mias-search [-h] --dataset DATASET --topics TOPICS --positions
POSITIONS --estimates ESTIMATES --webmias-url
WEBMIAS_URL
[--webmias-index-number WEBMIAS_INDEX_NUMBER]
[--num-workers-querying NUM_WORKERS_QUERYING]
[--num-workers-merging NUM_WORKERS_MERGING]
--output-directory OUTPUT_DIRECTORY
Use topics in the NTCIR-10 Math, NTCIR-11 Math-2, and NTCIR-12 MathIR format
to query the WebMIaS interface of the MIaS Math Information Retrieval system
and to retrieve result document lists.
optional arguments:
-h, --help show this help message and exit
--dataset DATASET A path to a directory containing a dataset in the
NTCIR-11 Math-2, and NTCIR-12 MathIR XHTML5 format.
The directory does not need to exist, since the path
is only required for extracting data from the file
with estimated positions of paragraph identifiers.
--topics TOPICS A path to a file containing topics in the NTCIR-10
Math, NTCIR-11 Math-2, and NTCIR-12 MathIR format.
--positions POSITIONS
The path to the file, where the estimated positions of
all paragraph identifiers from our dataset were stored
by the NTCIR Math Density Estimator package.
--estimates ESTIMATES
The path to the file, where the density, and
probability estimates for our dataset were stored by
the NTCIR Math Density Estimator package.
--webmias-url WEBMIAS_URL
The URL at which a WebMIaS Java Servlet has been
deployed.
--webmias-index-number WEBMIAS_INDEX_NUMBER
The numeric identifier of the WebMIaS index that
corresponds to the dataset. Defaults to 0.
--num-workers-querying NUM_WORKERS_QUERYING
The number of processes that will send queries to
WebMIaS. Defaults to 1. Note that querying, reranking,
and merging takes place simmultaneously.
--num-workers-merging NUM_WORKERS_MERGING
The number of processes that will rerank results.
Defaults to 3. Note that querying, reranking, and
merging takes place simmultaneously.
--output-directory OUTPUT_DIRECTORY
The path to the directory, where the output files will
be stored.
--plots PLOTS [PLOTS ...]
The path to the files, where the evaluation results
will plotted.
查询webmias
以下命令使用64工作者查询本地webmias实例 进程:
$ mkdir search_results
$ ntcir-mias-search --num-workers-querying 8 --num-workers-merging 56 \
> --dataset ntcir-11-12 \
> --topics NTCIR11-Math2-queries-participants.xml \
> --judgements NTCIR11_Math-qrels.dat \
> --estimates estimates.pkl.gz --positions positions.pkl.gz \
> --webmias-url http://localhost:58080/WebMIaS --webmias-index-number 1 \
> --plots plot.pdf plot.svg \
> --output-directory search_results
Reading relevance judgements from NTCIR11_Math-qrels.dat
50 judged topics and 2500 total judgements in NTCIR11_Math-qrels.dat
Reading topics from NTCIR11-Math2-queries-participants.xml
50 topics (NTCIR11-Math-1, NTCIR11-Math-2, ...) contain 55 formulae, and 113 keywords
Establishing connection with a WebMIaS Java Servlet at http://localhost:58080/WebMIaS
Reading paragraph position estimates from positions.pkl.gz
8301578 total paragraph identifiers in positions.pkl.gz
Reading density, and probability estimates from estimates.pkl.gz
Querying WebMIaSIndex(http://localhost:58080/WebMIaS, 1), reranking and merging results
Using 3 strategies to aggregate MIaS scores with probability estimates:
- The best possible score that uses relevance judgements (look for 'best' in filenames)
- The original MIaS score with the probability estimate discarded (look for 'orig' in filenames)
- The worst possible score that uses relevance judgements (look for 'worst' in filenames)
Storing reranked per-query result lists in search_results
Using 4 formats to represent mathematical formulae in queries:
- Content MathML XML language (look for 'CMath' in filenames)
- Combined Presentation and Content MathML XML language (look for 'PCMath' in filenames)
- Presentation MathML XML language (look for 'PMath' in filenames)
- The TeX language by professor Knuth (look for 'TeX' in filenames)
Result list for topic NTCIR11-Math-9 contains only 188 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-17 contains only 716 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-26 contains only 518 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-39 contains only 419 / 1000 results, sampling the dataset
Result list for topic NTCIR11-Math-43 contains only 924 / 1000 results, sampling the dataset
get_results: 100%|███████████████████████████████████████████████| 50/50 [00:26<00:00, 1.88it/s]
rerank_and_merge_results: 200it [01:02, 3.18it/s]
Storing final result lists in mias_search_results
100%|█████████████████████████████████████████████████████████████| 12/12 [00:13<00:00, 3.73it/s]
Evaluation results:
- best, PCMath: 0.5569
- best, PMath: 0.5283
- best, TeX: 0.5076
- best, CMath: 0.4983
- orig, PCMath: 0.4917
- ...
- orig, PMath: 0.4616
- worst, CMath: 0.3080
- worst, TeX: 0.2810
- worst, PMath: 0.1156
- worst, PCMath: 0.1141
Plotting plot.svg
Plotting plot.pdf
$ ls search_results
final_CMath.best.tsv
final_CMath.orig.tsv
final_CMath.worst.tsv
final_PCMath.best.tsv
final_PCMath.orig.tsv
final_PCMath.worst.tsv
final_PMath.best.tsv
final_PMath.orig.tsv
final_PMath.worst.tsv
final_TeX.best.tsv
final_TeX.orig.tsv
final_TeX.worst.tsv
NTCIR11-Math-10_CMath.1.query.txt
NTCIR11-Math-10_CMath.1.response.xml
NTCIR11-Math-10_CMath.1.results.best.tsv
NTCIR11-Math-10_CMath.1.results.orig.tsv
NTCIR11-Math-10_CMath.1.results.worst.tsv
NTCIR11-Math-10_CMath.2.query.txt
NTCIR11-Math-10_CMath.2.response.xml
...
下面的命令使用 64个工作进程:
$ mkdir search_results
$ ntcir-mias-search --num-workers-querying 8 --num-workers-merging 56 \
> --dataset ntcir-11-12 \
> --topics NTCIR11-Math2-queries-participants.xml \
> --judgements NTCIR11_Math-qrels.dat \
> --estimates estimates.pkl.gz --positions positions.pkl.gz \
> --webmias-url https://mir.fi.muni.cz/webmias-demo --webmias-index-number 0 \
> --plots plot.pdf plot.svg \
> --output-directory search_results
Reading relevance judgements from NTCIR11_Math-qrels.dat
50 judged topics and 2500 total judgements in NTCIR11_Math-qrels.dat
Reading topics from NTCIR11-Math2-queries-participants.xml
50 topics (NTCIR11-Math-1, NTCIR11-Math-2, ...) contain 55 formulae, and 113 keywords
Establishing connection with a WebMIaS Java Servlet at https://mir.fi.muni.cz/webmias-demo
Reading paragraph position estimates from positions.pkl.gz
8301578 total paragraph identifiers in positions.pkl.gz
Reading density, and probability estimates from estimates.pkl.gz
Querying WebMIaSIndex(https://mir.fi.muni.cz/webmias-demo, 0), reranking and merging results
Using 3 strategies to aggregate MIaS scores with probability estimates:
- The best possible score that uses relevance judgements (look for 'best' in filenames)
- The original MIaS score with the probability estimate discarded (look for 'orig' in filenames)
- The worst possible score that uses relevance judgements (look for 'worst' in filenames)
Storing reranked per-query result lists in search_results
Using 4 formats to represent mathematical formulae in queries:
- Content MathML XML language (look for 'CMath' in filenames)
- Combined Presentation and Content MathML XML language (look for 'PCMath' in filenames)
- Presentation MathML XML language (look for 'PMath' in filenames)
- The TeX language by professor Knuth (look for 'TeX' in filenames)
get_results: 100%|███████████████████████████████████████████████| 50/50 [05:29<00:00, 6.58s/it]
rerank_and_merge_results: 200it [06:57, 2.09s/it]
Storing final result lists in mias_search_results
100%|█████████████████████████████████████████████████████████████| 12/12 [00:13<00:00, 3.73it/s]
Evaluation results:
- best, PCMath: 0.5569
- best, PMath: 0.5283
- best, TeX: 0.5076
- best, CMath: 0.4983
- orig, PCMath: 0.4917
- ...
- orig, PMath: 0.4616
- worst, CMath: 0.3080
- worst, TeX: 0.2810
- worst, PMath: 0.1156
- worst, PCMath: 0.1141
Plotting plot.svg
Plotting plot.pdf
$ ls search_results
final_CMath.best.tsv
final_CMath.orig.tsv
final_CMath.worst.tsv
final_PCMath.best.tsv
final_PCMath.orig.tsv
final_PCMath.worst.tsv
final_PMath.best.tsv
final_PMath.orig.tsv
final_PMath.worst.tsv
final_TeX.best.tsv
final_TeX.orig.tsv
final_TeX.worst.tsv
NTCIR11-Math-10_CMath.1.query.txt
NTCIR11-Math-10_CMath.1.response.xml
NTCIR11-Math-10_CMath.1.results.best.tsv
NTCIR11-Math-10_CMath.1.results.orig.tsv
NTCIR11-Math-10_CMath.1.results.worst.tsv
NTCIR11-Math-10_CMath.2.query.txt
NTCIR11-Math-10_CMath.2.response.xml
...
贡献
要熟悉代码库,请参考 Umbrello项目文档project.xmi:
引用NTCIR MIAS搜索
文本
R_i_ka、Michal、Petr Sojka和Martin L_什卡。数学索引器和搜索器 胡德:胜利战略的历史和发展。在神田北野 佐贺,岸田克也。第11届全国学生委员会评价会议记录 信息存取技术。东京:国家信息学研究所, 2-1-2 Hitotsubashi,Chiyoda Ku,东京101-8430,日本,2014年。第127-134页,第8页。 国际标准书号978-4-86049-065-2。
bibtex
@inproceedings{mir:MIaSNTCIR-11,author="Michal R\r{u}\v{z}i\v{c}ka and Petr Sojka and Michal L{\' i}\v{s}ka",title="{Math Indexer and Searcher under the Hood: History and Development of a Winning Strategy}",month=Dec,year=2014,address="Tokyo",booktitle="{Proc. of the 11th NTCIR Conference on Evaluation of Information Access Technologies}",editor="Hideo Joho and Kazuaki Kishida",publisher="{NII, Tokyo, Japan}",pages="127--134",}