Python: 获取URL路径部分
我想知道怎么从一个网址中提取特定的路径部分。比如,我想要一个函数来处理这个网址:
http://www.mydomain.com/hithere?image=2934
然后返回“hithere”这个结果。
或者处理这个网址:
http://www.mydomain.com/hithere/something/else
同样返回“hithere”。
我知道这可能会用到urllib或者urllib2这个库,但我看了文档还是搞不清楚怎么只提取路径中的某一部分。
7 个回答
26
在处理网址的路径部分时,最好的选择是使用posixpath
模块。这个模块的使用方式和os.path
是一样的,并且在POSIX和Windows NT平台上都能一致地处理路径。
示例代码:
#!/usr/bin/env python3
import urllib.parse
import sys
import posixpath
import ntpath
import json
def path_parse( path_string, *, normalize = True, module = posixpath ):
result = []
if normalize:
tmp = module.normpath( path_string )
else:
tmp = path_string
while tmp != "/":
( tmp, item ) = module.split( tmp )
result.insert( 0, item )
return result
def dump_array( array ):
string = "[ "
for index, item in enumerate( array ):
if index > 0:
string += ", "
string += "\"{}\"".format( item )
string += " ]"
return string
def test_url( url, *, normalize = True, module = posixpath ):
url_parsed = urllib.parse.urlparse( url )
path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
normalize=normalize, module=module )
sys.stdout.write( "{}\n --[n={},m={}]-->\n {}\n".format(
url, normalize, module.__name__, dump_array( path_parsed ) ) )
test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
module = ntpath )
代码输出:
http://eg.com/hithere/something/else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=False,m=posixpath]-->
[ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=posixpath]-->
[ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=ntpath]-->
[ "see", "if", "this", "works" ]
注意事项:
- 在Windows NT平台上,
os.path
实际上是ntpath
- 在Unix/Posix平台上,
os.path
实际上是posixpath
ntpath
无法正确处理反斜杠(\
),所以推荐使用posixpath
。- 记得使用
urllib.parse.unquote
。 - 可以考虑使用
posixpath.normpath
。 - 多个路径分隔符(
/
)的语义在RFC 3986中并没有定义。不过,posixpath
会将多个相邻的路径分隔符合并(也就是说,它会把///
、//
和/
视为相同)。 - 尽管POSIX和网址路径的语法和语义相似,但它们并不完全相同。
规范性参考:
51
Python 3.4及以上版本的解决方案:
from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath
url = 'http://www.example.com/hithere/something/else'
PurePosixPath(
unquote(
urlparse(
url
).path
)
).parts[1]
# returns 'hithere' (the same for the URL with parameters)
# parts holds ('/', 'hithere', 'something', 'else')
# 0 1 2 3
67
用 urlparse(Python 2.7)来提取网址中的路径部分:
import urlparse
path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
print path
> '/hithere/something/else'
或者用 urllib.parse(Python 3):
import urllib.parse
path = urllib.parse.urlparse('http://www.example.com/hithere/something/else').path).path
print(path)
> '/hithere/something/else'
用 os.path.split 来把路径分成几个部分:
>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')
dirname 和 basename 这两个函数可以给你分开的两个部分;你可以在一个循环里用 dirname:
>>> while os.path.dirname(path) != '/':
... path = os.path.dirname(path)
...
>>> path
'/hithere'