在没有请求模块的情况下在 Python 3 和 2 中下载网页和文件

我创建了一个名为 whatismyip 的 Python 模块,它允许 Python 程序轻松找出它们的互联网协议 (IP) 地址是什么。 它通过连接到返回此信息的几个公共网站之一来工作,例如 https://icanhazip.com/。

因为它是一个要包含在其他程序中的模块,所以我希望它具有尽可能少的依赖性。 通常我会使用 requests 模块来下载这些网页,但我只想坚持使用 Python 标准库。 以下是我如何仅使用 Python 3 和 2 上的标准库来下载网页的 HTML:

import sys

if sys.version_info[0] == 3:  # Python 3
    from urllib.request import Request, urlopen
elif sys.version_info[0] == 2:  # Python 2
    from urllib2 import Request, urlopen

# Supply a user-agent header of a common browser, since some web servers will refuse to reply to scripts without one.
# 'https://ifconfig.me' is a website that returns simple info about your request. Replace this with the page you want to download.
requestObj = Request('https://ifconfig.me/all', headers={'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'})
responseObj = urlopen(requestObj)

# To figure out how to decode the downloaded binary data to text, we need to get the character set encoding:
if sys.version_info[0] == 3:  # Python 3
    charsets = responseObj.info().get_charsets()
    if len(charsets) == 0 or charsets[0] is None:
        # Character set encoding could not be determined.
        charset = 'utf-8'  # Use the utf-8 encoding by default.
    else:
        # Use the first character set encoding listed. (It's often the only one.)
        charset = charsets[0]
elif sys.version_info[0] == 2:  # Python 2
    charset = responseObj.headers.getencoding()
    if charset == '7bit':
        # Even though getencoding() returns '7bit', this is an unknown encoding to decode(). '7bit' is the same as 'ascii'
        charset = 'ascii'

content = responseObj.read().decode(charset)
print(content)  # The HTML of the web page.

下面是从 URL 下载二进制文件(例如 .png 图像或 .zip 文件)的代码:

import sys

if sys.version_info[0] == 3:  # Python 3
    from urllib.request import Request, urlopen
elif sys.version_info[0] == 2:  # Python 2
    from urllib2 import Request, urlopen

# Replace https://inventwithpython.com/images/cover_automate2_thumb.jpg with the file you want to download.
url = 'https://inventwithpython.com/images/cover_automate2_thumb.jpg'

# Supply a user-agent header of a common browser, since some web servers will refuse to reply to scripts without one.
requestObj = Request(url, headers={'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'})
responseObj = urlopen(requestObj)

content = responseObj.read()
# Replace foo.jpg with the local filename you want to use:
filename = url.split('/')[-1]  # Use the filename from the url.
with open(filename, 'wb') as fileObj:
    fileObj.write(content)

urllib Python 2中的模块是Python 1.2新增的Python标准库中的原始下载模块。 这 urllib2 Python 2 中的模块具有附加功能,并在 Python 1.6 中添加。 在 Python 3 中,有一个名为 urllib. 还有第三方模块名为 urllib3requests (使用 urllib3) 但这些不在 Python 标准库中,也不会添加到标准库中。

阅读更多

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注