网站文章/adobe pdf文件发现和提取

stimson-web-scraper的Python项目详细描述


stimson刮板

抓取和抓取任何ISO语言的文本数据和URL网站

目录

Mac OS入门

在终端窗口中:

    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    xcode-select --install
    brew update
    brew upgrade

    git --version
    git version 2.24.1 (Apple Git-126)

    brew install python3
    python3 --version
        Python 3.7.7

    pip3 install -U pytest
    py.test --version
	This is pytest version 5.4.1, imported from /usr/local/lib/python3.7/site-packages/pytest/__init__.py

安装桌面工具

下载GitHub桌面

^{pr2}$

可选择下载PyCharm Professional

    open https://www.jetbrains.com/pycharm/download

生成SSH公钥的服务器上的Git

Reference

open https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
check to make sure your github key has been added to the ssh-agent list.  Here's my ~/.ssh/config file

 Host github.com github
     IdentityFile ~/.ssh/id_rsa
     IdentitiesOnly yes
     UseKeyChain yes
     AddKeysToAgent yes
cd ~/.ssh
    ssh-keygen -o
    ssh-add -K ~/.ssh/id_rsa
    ssh-add -L

获取项目源代码

cd ~
    git clone https://github.com/Stimson-Center/stimson-web-scraper.git

网页抓取入门

执行测试套件以确保环境完整性

cd ~/stimson-web-scraper
    ./run_tests.sh

作为Python3可执行文件执行

cd ~/stimson-web-scraper/scraper
    ./start.sh
    ./cli.py -u https://www.yahoo.com -l en

作为Python3包执行

从网站页面获取文章

importdatetimefromscraperimportArticleurl='http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'article=Article(url)article.build()# Access Data scraped from this web site pagearticle.authors['Leigh Ann Caldwell','John Honway']article.publish_datedatetime.datetime(2013,12,30,0,0)article.text'Washington (CNN) -- Not everyone subscribes to a New Year'sresolution...'article.top_image'http://someCDN.com/blah/blah/blah/file.png'article.movies['http://youtube.com/path/to/link.com',...]article.keywords['New Years','resolution',...]article.summary'The study shows that 93% of people ...'article.html'<!DOCTYPE HTML><html itemscope itemtype="http://...'

外语网站

scraper可以无缝地提取和检测语言。 如果没有指定语言,报纸将尝试自动检测一种语言。 如果你确定一个从刮刀然后你可以指定它由两个字母ISO代码

查看支持的ISO语言列表

importscraperscraper.get_languages()
Your available languages are:
input code         full name
af			  Afrikaans
ar			  Arabic
be			  Belarusian
bg			  Bulgarian
bn			  Bengali
br			  Portuguese, Brazil
ca			  Catalan
cs			  Czech
da			  Danish
de			  German
el			  Greek
en			  English
eo			  Esperanto
es			  Spanish
et			  Estonian
eu			  Basque
fa			  Persian
fi			  Finnish
fr			  French
ga			  Irish
gl			  Galician
gu			  Gujarati
ha			  Hausa
he			  Hebrew
hi			  Hindi
hr			  Croatian
hu			  Hungarian
hy			  Armenian
id			  Indonesian
it			  Italian
ja			  Japanese
ka			  Georgian
ko			  Korean
ku			  Kurdish
la			  Latin
lt			  Lithuanian
lv			  Latvian
mk			  Macedonian
mr			  Marathi
ms			  Malay
nb			  Norwegian (Bokmål)
nl			  Dutch
no			  Norwegian
np			  Nepali
pl			  Polish
pt			  Portuguese
ro			  Romanian
ru			  Russian
sk			  Slovak
sl			  Slovenian
so			  Somali
sr			  Serbian
st			  Sotho, Southern
sv			  Swedish
sw			  Swahili
ta			  Tamil
th			  Thai
tl			  Tagalog
tr			  Turkish
uk			  Ukrainian
ur			  Urdu
vi			  Vietnamese
yo			  Yoruba
zh			  Chinese
zu			  Zulu
importscraperscraper.get_languages(){'ar':'Arabic','af':'Afrikaans','be':'Belarusian','bg':'Bulgarian','bn':'Bengali','br':'Portuguese, Brazil','ca':'Catalan','cs':'Czech','da':'Danish','de':'German','el':'Greek','en':'English','eo':'Esperanto','es':'Spanish','et':'Estonian','eu':'Basque','fa':'Persian','fi':'Finnish','fr':'French','ga':'Irish','gl':'Galician','gu':'Gujarati','ha':'Hausa','he':'Hebrew','hi':'Hindi','hr':'Croatian','hu':'Hungarian','hy':'Armenian','id':'Indonesian','it':'Italian','ja':'Japanese','ka':'Georgian','ko':'Korean','ku':'Kurdish','la':'Latin','lt':'Lithuanian','lv':'Latvian','mk':'Macedonian','mr':'Marathi','ms':'Malay','nb':'Norwegian (Bokmål)','nl':'Dutch','no':'Norwegian','np':'Nepali','pl':'Polish','pt':'Portuguese','ro':'Romanian','ru':'Russian','sk':'Slovak','sl':'Slovenian','so':'Somali','sr':'Serbian','st':'Sotho, Southern','sv':'Swedish','sw':'Swahili','ta':'Tamil','th':'Thai','tl':'Tagalog','tr':'Turkish','uk':'Ukrainian','ur':'Urdu','vi':'Vietnamese','yo':'Yoruba','zh':'Chinese','zu':'Zulu'}

以支持的ISO语言导入项目

fromscraperimportArticleurl='http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'article=Article(url,language='zh')# Chinesearticle.build()print(article.text[:150])香港行政长官梁振英在各方压力下就其大宅的违章建僭建问题到立法会接受质询并向香港民众道歉梁振英在星期二12月10日的答问大会开始之际在其演说中道歉但强调他在违章建筑问题上没有隐瞒的意图和动机一些亲北京阵营议员欢迎梁振英道歉且认为应能获得香港民众接受但这些议员也质问梁振英有print(article.title)港特首梁振英就住宅违建事件道歉# If you are certain that an from scraper import Articleurl='http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'article=Article(url,language='zh')# Chinesearticle.build()print(article.text[:150])香港行政长官梁振英在各方压力下就其大宅的违章建僭建问题到立法会接受质询并向香港民众道歉梁振英在星期二12月10日的答问大会开始之际在其演说中道歉但强调他在违章建筑问题上没有隐瞒的意图和动机一些亲北京阵营议员欢迎梁振英道歉且认为应能获得香港民众接受但这些议员也质问梁振英有print(article.title)港特首梁振英就住宅违建事件道歉

从任何ISO语言的Adobe PDF文件中提取文本

fromscraperimportArticleurl="http://tpch-th.listedcompany.com/misc/ShareholderMTG/egm201701/20170914-tpch-egm201701-enc02-th.pdf"article=Article(url=url,language='th')article.build()print(article.text)

获取包含嵌入表的Wikipedia文章

fromscraperimportArticleurl="https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart_for_English_dialects"article=Article(url=url,language='en')article.build()print(article.text)print(article.tables)

可选地设置Docker环境

    brew install docker
    docker --version
    cd ~/stimson-web-scraper
    ./run_docker.sh

您将被放入虚拟机:

(venv)tf docker/app>

    ./run_tests.sh

有关详细信息,请参见:

Docker Tutorial

贡献

  • 叉开
  • 创建您的功能分支(git checkout -b your_github_name-feature
  • 提交更改(git commit -am 'Added some feature'
  • 确保为它添加测试。这一点很重要,所以我们不会无意中在将来的版本中破坏它。在
  • File an Issue
  • 推送到分支(git push origin your_github_name-feature
  • 创建新的请求请求

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
Java中ArrayList的超简单问题   Java 8在一段时间后过期   java如何创建具有用户定义维度的矩阵,并使用从上到下、从左到右的递增值填充它?   java从JDBC重启mysql   带有sqlite的java LiveData未更新UI   带有JDialog的java小程序在Mac OSX中未正确隐藏   java ActionListener无法从公共类引用数组?   java Apache Digester:NoSuchMethodException:没有这样的可访问方法   安卓中数据库中的java数据没有以正确的格式检索   java快速排序实现:使用random pivot时几乎排序   安卓 Java:高效的ArrayList过滤?   java如何在单独的文件中制作GUI程序   jasper报告如何从JSP或Java代码在JasperReport中传递参数值?