网站文章/adobe pdf文件发现和提取
stimson-web-scraper的Python项目详细描述
stimson刮板
抓取和抓取任何ISO语言的文本数据和URL网站
目录
- Getting Started on Mac OS
- Install Desktop tools 在
- Git on the Server Generating Your SSH Public Key
- get project source code
- Getting started with Web Scraping 在
- Optionally Setting up a Docker environment
- Contributing
Mac OS入门
在终端窗口中:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" xcode-select --install brew update brew upgrade git --version git version 2.24.1 (Apple Git-126) brew install python3 python3 --version Python 3.7.7 pip3 install -U pytest py.test --version This is pytest version 5.4.1, imported from /usr/local/lib/python3.7/site-packages/pytest/__init__.py
安装桌面工具
下载GitHub桌面
^{pr2}$可选择下载PyCharm Professional
open https://www.jetbrains.com/pycharm/download
生成SSH公钥的服务器上的Git
open https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
check to make sure your github key has been added to the ssh-agent list. Here's my ~/.ssh/config file
Host github.com github
IdentityFile ~/.ssh/id_rsa
IdentitiesOnly yes
UseKeyChain yes
AddKeysToAgent yes
cd ~/.ssh
ssh-keygen -o
ssh-add -K ~/.ssh/id_rsa
ssh-add -L
获取项目源代码
cd ~
git clone https://github.com/Stimson-Center/stimson-web-scraper.git
网页抓取入门
执行测试套件以确保环境完整性
cd ~/stimson-web-scraper
./run_tests.sh
作为Python3可执行文件执行
cd ~/stimson-web-scraper/scraper
./start.sh
./cli.py -u https://www.yahoo.com -l en
作为Python3包执行
从网站页面获取文章
importdatetimefromscraperimportArticleurl='http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'article=Article(url)article.build()# Access Data scraped from this web site pagearticle.authors['Leigh Ann Caldwell','John Honway']article.publish_datedatetime.datetime(2013,12,30,0,0)article.text'Washington (CNN) -- Not everyone subscribes to a New Year'sresolution...'article.top_image'http://someCDN.com/blah/blah/blah/file.png'article.movies['http://youtube.com/path/to/link.com',...]article.keywords['New Years','resolution',...]article.summary'The study shows that 93% of people ...'article.html'<!DOCTYPE HTML><html itemscope itemtype="http://...'
外语网站
scraper可以无缝地提取和检测语言。 如果没有指定语言,报纸将尝试自动检测一种语言。 如果你确定一个从刮刀然后你可以指定它由两个字母ISO代码
查看支持的ISO语言列表
importscraperscraper.get_languages()
Your available languages are:
input code full name
af Afrikaans
ar Arabic
be Belarusian
bg Bulgarian
bn Bengali
br Portuguese, Brazil
ca Catalan
cs Czech
da Danish
de German
el Greek
en English
eo Esperanto
es Spanish
et Estonian
eu Basque
fa Persian
fi Finnish
fr French
ga Irish
gl Galician
gu Gujarati
ha Hausa
he Hebrew
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
id Indonesian
it Italian
ja Japanese
ka Georgian
ko Korean
ku Kurdish
la Latin
lt Lithuanian
lv Latvian
mk Macedonian
mr Marathi
ms Malay
nb Norwegian (Bokmål)
nl Dutch
no Norwegian
np Nepali
pl Polish
pt Portuguese
ro Romanian
ru Russian
sk Slovak
sl Slovenian
so Somali
sr Serbian
st Sotho, Southern
sv Swedish
sw Swahili
ta Tamil
th Thai
tl Tagalog
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
yo Yoruba
zh Chinese
zu Zulu
importscraperscraper.get_languages(){'ar':'Arabic','af':'Afrikaans','be':'Belarusian','bg':'Bulgarian','bn':'Bengali','br':'Portuguese, Brazil','ca':'Catalan','cs':'Czech','da':'Danish','de':'German','el':'Greek','en':'English','eo':'Esperanto','es':'Spanish','et':'Estonian','eu':'Basque','fa':'Persian','fi':'Finnish','fr':'French','ga':'Irish','gl':'Galician','gu':'Gujarati','ha':'Hausa','he':'Hebrew','hi':'Hindi','hr':'Croatian','hu':'Hungarian','hy':'Armenian','id':'Indonesian','it':'Italian','ja':'Japanese','ka':'Georgian','ko':'Korean','ku':'Kurdish','la':'Latin','lt':'Lithuanian','lv':'Latvian','mk':'Macedonian','mr':'Marathi','ms':'Malay','nb':'Norwegian (Bokmål)','nl':'Dutch','no':'Norwegian','np':'Nepali','pl':'Polish','pt':'Portuguese','ro':'Romanian','ru':'Russian','sk':'Slovak','sl':'Slovenian','so':'Somali','sr':'Serbian','st':'Sotho, Southern','sv':'Swedish','sw':'Swahili','ta':'Tamil','th':'Thai','tl':'Tagalog','tr':'Turkish','uk':'Ukrainian','ur':'Urdu','vi':'Vietnamese','yo':'Yoruba','zh':'Chinese','zu':'Zulu'}
以支持的ISO语言导入项目
fromscraperimportArticleurl='http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'article=Article(url,language='zh')# Chinesearticle.build()print(article.text[:150])香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有print(article.title)港特首梁振英就住宅违建事件道歉# If you are certain that an from scraper import Articleurl='http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'article=Article(url,language='zh')# Chinesearticle.build()print(article.text[:150])香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有print(article.title)港特首梁振英就住宅违建事件道歉
从任何ISO语言的Adobe PDF文件中提取文本
fromscraperimportArticleurl="http://tpch-th.listedcompany.com/misc/ShareholderMTG/egm201701/20170914-tpch-egm201701-enc02-th.pdf"article=Article(url=url,language='th')article.build()print(article.text)
获取包含嵌入表的Wikipedia文章
fromscraperimportArticleurl="https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart_for_English_dialects"article=Article(url=url,language='en')article.build()print(article.text)print(article.tables)
可选地设置Docker环境
brew install docker
docker --version
cd ~/stimson-web-scraper
./run_docker.sh
您将被放入虚拟机:
(venv)tf docker/app>
./run_tests.sh
有关详细信息,请参见:
贡献
- 叉开
- 创建您的功能分支(
git checkout -b your_github_name-feature
) - 提交更改(
git commit -am 'Added some feature'
) - 确保为它添加测试。这一点很重要,所以我们不会无意中在将来的版本中破坏它。在
- File an Issue
- 推送到分支(
git push origin your_github_name-feature
) - 创建新的请求请求
- 项目
标签: