Python 爬虫入门

步骤

  • 获取网页内容

  • 解析网页内容

  • 储存或分析数据

HTTP请求和响应

  • GET 方法——获得数据

  • POST方法——创建数据

常见状态码
  • 200 OK

  • 301 Moved Permanently

  • 400 Bad Request

  • 401 Unauthoried

  • 403 Forbidden

  • 404 Not Found

  • 500 internal Server Error

  • 503 Server Unavailable

Request发送请求


import requests

headers = {

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

} # 定义请求头,伪装成浏览器请求

response = requests.get("https://jia.cx/",headers=headers)

print(response) # response 示例

print(response.status_code) # 响应的状态码

if response.ok:

    print("请求成功")

    print(response.text) # 返回内容

else:

    print("请求失败")

Beautiful Soup解析HTML

ihpx4o-1.webp


from bs4 import BeautifulSoup

import requests

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = "https://jia.cx/"

content = requests.get(url, headers=headers)

soup = BeautifulSoup(content.text, 'html.parser')

all_titles = soup.find_all('h1')

print(all_titles)

项目实战

爬取豆瓣TOP 250电影标题

from bs4 import BeautifulSoup

import requests

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

for start_num in range(0,250, 25):

    url = f"https://movie.douban.com/top250?start={start_num}&filter="

    response = requests.get(url, headers=headers)

    html = response.text

    context = BeautifulSoup(html, 'html.parser')

    all_titles = context.findAll("span", attrs={"class": "title"})

    for title in all_titles:

        title_string = title.string

        if "/" not in title_string:

            print(title_string)