从作为参数传入的文件中包含的url列表中获取页面和文章标题
title-grabber-cristianrasch的Python项目详细描述
标题抓取器
使用说明
- 只需给它一个或多个包含url的文件(每行一个)
python -m title_grabber /abs/path/2/urls1.csv rel/path/2/urls2.csv
- 或者,更改输出文件:
python -m title_grabber -o output.csv /abs/path/2/urls1.csv rel/path/2/urls2.csv
- 查看所有可用的配置选项:
python -m title_grabber -h
usage: title_grabber [-h] [-o OUT_FILE] [--connect-timeout TIMEOUT]
[--read-timeout TIMEOUT] [-r RETRIES] [-t THREADS] [-d]
[FILES [FILES ...]]
positional arguments:
FILES 1 or more CSV files containing URLs (1 per line)
optional arguments:
-h, --help show this help message and exit
-o OUT_FILE, --output OUT_FILE
Output file (defaults to out.csv)
--connect-timeout TIMEOUT
HTTP connect timeout. Defaults to the value of the
CONNECT_TIMEOUT env var or 10
--read-timeout TIMEOUT
HTTP read timeout. Defaults to the value of the
READ_TIMEOUT env var or 15
--max-redirects REDIRECTS
Max. # of HTTP redirects to follow. Defaults to the
value of the MAX_REDIRECTS env var or 5
-r RETRIES, --max-retries RETRIES
Max. # of times to retry failed HTTP reqs. Defaults to
the value of the MAX_RETRIES env var or 3
-t THREADS, --max-threads THREADS
Max. # of threads to use. Defaults to the value of the
MAX_THREADS env var or the # of logical processors in
the system (8)
-d, --debug Log to STDOUT instead of to a file in the CWD.
Defaults to the value of the DEBUG env var or False
-V, --version Print program version and exit
开发设置说明
- 克隆项目
git clone git@github.com:cristianrasch/title_grabber.git
- 为它创建一个新的虚拟环境
cd title_grabber && python3 -m venv venv
- 安装其依赖项
pip install -r requirements.txt
- 运行测试套件以确保一切设置正常
python -m unittest discover -v -s title_grabber/tests/