爬虫学习用beautifulsoup解析豆瓣电影top250

我的第一个爬虫项目

学习完python基本语法和基本静态网页爬虫之后深感自己需要拿点东西来练手,于是先找到了比较容易的豆瓣入手,最后导出成csv。朋友们可以根据豆瓣电影top250看起来啦。

主要使用的是request库和beautifulsoup去解析。中间在正则表达式上纠结了很久,网上搜了现成的但是怎么弄都不太对,后来发现是搞错了^这个符号。(敲重点:目前能在谷歌中找到的关于1-10位数字的匹配的正则表达式都不对)

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# -*- coding: UTF-8 -*-
import requests
from bs4 import BeautifulSoup
import re
import csv
def gethtml(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
print('error')
def getinfo(html):
list = []
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all(attrs = {'class':'item'}):
#get the title first
try:
title = item.find_all(attrs={'class':'title'})[0].text
#get rating number
ratings = item.find_all(attrs={'class':'rating_num'})[0].text
#get numbers of people rating
star = item.find_all(attrs={'class':'star'})[0].get_text()
pat = re.compile(r'[0-9]\d{1,12}')
ratingN = pat.search(star)[0]
tup = (title,ratings,ratingN)
list.append(tup)
except:
continue
return list
def write_to_csv(list):
with open("topmovies.csv","a") as f:
writer=csv.writer(f, delimiter=",",lineterminator="\r\n")
writer.writerows(list)
def main():
start_url = "https://movie.douban.com/top250?start={0}&filter="
depth = 10
for i in range(depth):
url = start_url.format(i*25)
html = gethtml(url)
list = getinfo(html)
write_to_csv(list)
main()

最后的csv截取:
Alt text
Alt text
看起来top10的电影我只看了6/7部啊。
喜欢的初恋这件小事也上榜了哈哈