在没有请求模块的情况下在 Python 3 和 2 中下载网页和文件
我创建了一个名为 whatismyip 的 Python 模块,它允许 Python 程序轻松找出它们的互联网协议 (IP) 地址是什么。 它通过连接到返回此信息的几个公共网站之一来工作,例如 https://icanhazip.com/。
因为它是一个要包含在其他程序中的模块,所以我希望它具有尽可能少的依赖性。 通常我会使用 requests
模块来下载这些网页,但我只想坚持使用 Python 标准库。 以下是我如何仅使用 Python 3 和 2 上的标准库来下载网页的 HTML:
import sys if sys.version_info[0] == 3: # Python 3 from urllib.request import Request, urlopen elif sys.version_info[0] == 2: # Python 2 from urllib2 import Request, urlopen # Supply a user-agent header of a common browser, since some web servers will refuse to reply to scripts without one. # 'https://ifconfig.me' is a website that returns simple info about your request. Replace this with the page you want to download. requestObj = Request('https://ifconfig.me/all', headers={'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'}) responseObj = urlopen(requestObj) # To figure out how to decode the downloaded binary data to text, we need to get the character set encoding: if sys.version_info[0] == 3: # Python 3 charsets = responseObj.info().get_charsets() if len(charsets) == 0 or charsets[0] is None: # Character set encoding could not be determined. charset = 'utf-8' # Use the utf-8 encoding by default. else: # Use the first character set encoding listed. (It's often the only one.) charset = charsets[0] elif sys.version_info[0] == 2: # Python 2 charset = responseObj.headers.getencoding() if charset == '7bit': # Even though getencoding() returns '7bit', this is an unknown encoding to decode(). '7bit' is the same as 'ascii' charset = 'ascii' content = responseObj.read().decode(charset) print(content) # The HTML of the web page.
下面是从 URL 下载二进制文件(例如 .png 图像或 .zip 文件)的代码:
import sys if sys.version_info[0] == 3: # Python 3 from urllib.request import Request, urlopen elif sys.version_info[0] == 2: # Python 2 from urllib2 import Request, urlopen # Replace https://inventwithpython.com/images/cover_automate2_thumb.jpg with the file you want to download. url = 'https://inventwithpython.com/images/cover_automate2_thumb.jpg' # Supply a user-agent header of a common browser, since some web servers will refuse to reply to scripts without one. requestObj = Request(url, headers={'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'}) responseObj = urlopen(requestObj) content = responseObj.read() # Replace foo.jpg with the local filename you want to use: filename = url.split('/')[-1] # Use the filename from the url. with open(filename, 'wb') as fileObj: fileObj.write(content)
这 urllib
Python 2中的模块是Python 1.2新增的Python标准库中的原始下载模块。 这 urllib2
Python 2 中的模块具有附加功能,并在 Python 1.6 中添加。 在 Python 3 中,有一个名为 urllib
. 还有第三方模块名为 urllib3
和 requests
(使用 urllib3
) 但这些不在 Python 标准库中,也不会添加到标准库中。