豆瓣电影可视化分析

本文以python为工具,对于豆瓣电影做了一定的可视化分析
具体的代码以及返回结果附在文末

数据规模

为了使得数据有代表性,本次分析共选取了从1902-2016共114年间的共一万五千多条数据。

电影评分总体趋势

本文对每一年豆瓣的电影的评分求了平均值,并且取了最近三十年的数据画了一个趋势图:
WechatIMG149.jpeg-51.9kB
从图中我们可以看出,剔除极端值以后,近三十年来,电影事业总体发展较为平稳,偶有波动,平均电影分数稳定在7-8之间,这是一个比较高的水准。

电影类型介绍

本文对15000多部电影进行了归类,结果如下饼图:
(为方便统计,将总量小于20的电影归类于其他)
WechatIMG151.jpeg-54.4kB
可以看出目前电影市场上有着各种类型的电影片子,其中最多的是剧情片,其次是喜剧片,动作、记录、犯罪等。

烂片、好片的比例

定义豆瓣电影评分大于9分的算好电影,小于5分的算坏电影
在统计的片子里面大约有5.73%的电影评分小于5分
大约有304部片子评分大于9分,占比约0.02%
说明电影行业发展参差不齐,出现了一些为了票房而滥竽充数之作,也侧面说明了观众对于电影评分的标准比较严格

评分最高、最低的电影

根据结果,评分最高的电影是:神秘博士:DT的视频日志:最后的日子,9.7分,属于纪录片
评分最低的电影是:嫁给大山的女人,2.1分,属于剧情片

评论最多的电影

根据结果,评论最多的电影同评分最高的电影,是:神秘博士:DT的视频日志:最后的日子

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# -*- coding: utf-8 -*-
import csv
import sys
def isBadMovie(score):
if score<5:
return True
else:
return False
def isGoodMovie(score):
if score >= 9:
return True
else:
return False
def douban():
csvfile = file('movie1.csv', 'rb')
reader = csv.reader(csvfile)
rating = []
for line in reader:
rating.append(line)
return rating
def myCmp(x,y):
if x[f["rateing"]]<y[f["rateing"]]:
return -1
elif x[f["rateing"]]>y[f["rateing"]]:
return 1
else:
return 0
def myCmp2(x,y):
if x[f["reviews"]] < y[f["reviews"]]:
return -1
elif x[f["reviews"]] > y[f["reviews"]]:
return 1
else:
return 0
f={"IMDB":0,"title":1,"rateing":2,"rank":3,"is_playable":4,"types":5,"type":6,"regions":7,"release_date":8,
"year":9,"actor_count":10,"vote_count":11,"actors":12,"typeid":13,"daoyan":14,"bianju":15,"runtime":16,
"star1":17,"star2":18,"star3":19,"star4":20,"star5":21,"reviews":22}
if __name__ == "__main__":
fw = open('result.txt','w')
type = sys.getfilesystemencoding()
# m[1]IMBD m[2]title m[3]rateing m[4]type m[5]release_date m[6]star1 m[7]star2 m[8]star3 m[9]star4 m[10]star5 m[11]reviews
movies = douban()
#
# for item in range(1,10):
# print movies[item]
del movies[0]
imdb = []
title = []
scores = []
types = []
years = []
numOfComment = []
for item in movies:
imdb.append(item[f["IMDB"]])
title.append(item[f["title"]])
scores.append(float(item[f["rateing"]]))
types.append(item[f["type"]])
temp= int(item[f["year"]])
years.append(temp)
numOfComment.append(item[f["reviews"]])
#1
min = years[0]
max = years[0]
for item in years:
if item == 0:
continue
if item < min:
min = item
if item > max:
max = item
fw.write("min final:"+ str(min)+"\n")
fw.write("max final:"+str(max) +"\n")
fw.write("year period:"+str(max-min)+"\n")
#2 x:years y:avgscore
avgscore = {}
for year in years:
temp = []
length = len(years)
for i in range(0,length):
if i == year:
temp.append(scores[i])
if(year!= 0):
avgscore[year]=sum(temp)/len(temp)
fw.write(str(avgscore)+"\n")
# 3 dw
badMovies=filter(isBadMovie,scores)
fw.write("bad movies:"+str(len(badMovies)/float(len(scores))*100)+"%\n")
#4
type_num={}
length2=len(imdb)
for i in range(1,length2):
if types[i] in type_num:
type_num[types[i]]+=1
else:
type_num[types[i]]=1
# fw.write(("num of each type:\n"+str(type_num)+"\n").encode('gb2312'))
fw.write("num of each type:\n");
for k,v in type_num.iteritems():
fw.write(str(k)+":"+str(v)+"\t")
# fw.write(":")
# fw.write(v)
#7
fw.write("\nnum of movies whose score >= 9:"+str(len(filter(isGoodMovie,scores)))+"\n")
#5 6 8
movies.sort(myCmp)
fw.write(("maxscore film :\tIMDB:\t"+str(movies[-1][f["IMDB"]])+ "\ttitle:"+str(movies[-1][f["title"]])+"\ttype:"+str(movies[-1][f["type"]])+"\n"))
fw.write(("minscore film :\tIMDB:\t" + str(movies[0][f["IMDB"]]) + "\ttitle:" + str(
movies[0][f["title"]]) + "\ttype:" + str(movies[0][f["type"]]) + "\n"))
fw.write(("movie with the most comment people film :\tIMDB:\t" + str(movies[-1][f["IMDB"]]) + "\ttitle:" + str(
movies[-1][f["title"]]) + "\ttype:" + str(movies[-1][f["type"]]) + "\n"))
fw.close()

返回结果:

min final:1902

max final:2016

year period:114

平均评分:
{1902: 6.2, 1903: 6.2, 1904: 5.0, 1905: 7.6, 1906: 6.7, 1907: 7.5, 1908: 6.3, 1911: 7.2, 1912: 6.2, 1914: 7.4, 1915: 7.8, 1916: 6.4, 1918: 7.9, 1919: 6.7, 1920: 7.0, 1921: 6.4, 1922: 8.0, 1923: 8.2, 1924: 7.6, 1925: 9.0, 1926: 8.3, 1927: 6.9, 1928: 7.5, 1929: 8.2, 1930: 7.0, 1931: 8.2, 1932: 7.9, 1933: 7.7, 1934: 5.3, 1935: 9.1, 1936: 7.4, 1937: 7.0, 1938: 8.0, 1939: 7.9, 1940: 7.9, 1941: 6.3, 1942: 6.6, 1943: 3.8, 1944: 8.0, 1945: 7.9, 1946: 6.8, 1947: 8.8, 1948: 7.6, 1949: 8.1, 1950: 7.8, 1951: 7.7, 1952: 7.5, 1953: 6.7, 1954: 7.7, 1955: 8.2, 1956: 6.8, 1957: 7.4, 1958: 8.7, 1959: 7.4, 1960: 6.7, 1961: 8.6, 1962: 6.9, 1963: 6.8, 1964: 5.4, 1965: 8.1, 1966: 7.8, 1967: 6.4, 1968: 8.7, 1969: 6.4, 1970: 4.9, 1971: 6.7, 1972: 5.7, 1973: 6.7, 1974: 8.0, 1975: 8.3, 1976: 7.6, 1977: 7.1, 1978: 7.1, 1979: 7.0, 1980: 7.5, 1981: 6.4, 1982: 6.4, 1983: 6.8, 1984: 6.2, 1985: 7.7, 1986: 6.1, 1987: 2.8, 1988: 2.8, 1989: 2.8, 1990: 7.6, 1991: 6.4, 1992: 6.4, 1993: 7.3, 1994: 7.9, 1995: 8.8, 1996: 7.0, 1997: 8.3, 1998: 8.5, 1999: 7.4, 2000: 7.7, 2001: 8.3, 2002: 6.5, 2003: 5.8, 2004: 6.6, 2005: 6.6, 2006: 7.1, 2007: 8.7, 2008: 7.4, 2009: 5.4, 2010: 7.4, 2011: 7.9, 2012: 8.3, 2013: 7.8, 2014: 7.9, 2015: 8.3, 2016: 8.2}

bad movies:5.73028586312%

num of each type:
音乐:21    同性:13    传记:327    短片:241    黑色电影:4    运动:7    纪录片:667    西部:16    惊悚:264    科幻:156    家庭:103    历史:44    武侠:4    悬疑:205    Adult:1    冒险:333    战争:38    歌舞:30    喜剧:3076    古装:3    情色:11    爱情:350    恐怖:493    动作:1961    剧情:5150    犯罪:653    动画:1026    奇幻:100    荒诞:1    灾难:1    儿童:57    

num of movies whose score >= 9: 304

maxscore film :    
IMDB:    5761 title:神秘博士:DT的视频日志:最后的日子 David Tennant‘s Video Diary - The Final Days    type:纪录片

minscore film :    IMDB:    487    title:嫁给大山的女人 type:剧情

movie with the most comment people film :    
IMDB:    5761    title:神秘博士:DT的视频日志:最后的日子 David Tennant’s Video Diary - The Final Days    type:纪录片