爬虫实战——豆瓣电影Top250
准备阶段
网页分析
在目标网页直接ctrl
+u
查看网页源代码(或者F12
审查),豆瓣的网页源代码就出现了(非常友好):
<!DOCTYPE html>
<html lang="zh-CN" class="ua-windows ua-webkit">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="renderer" content="webkit">
<meta name="referrer" content="always">
<meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
<title>
豆瓣电影 Top 250
</title>
<meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />
<meta http-equiv="Pragma" content="no-cache">
......
然后我们翻到330行左右的位置,有如下代码:
<ol class="grid_view">
<li>
<div class="item">
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title"> / The Shawshank Redemption</span>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span property="v:best" content="10.0"></span>
<span>2304569人评价</span>
</div>
<p class="quote">
<span class="inq">希望让人自由。</span>
</p>
<p>
<span class="gact">
<a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看</a>
</span>
<span class="gact">
<a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过</a>
</span>
</p>
</div>
</div>
</div>
</li>
<li>
<div class="item">
<div class="pic">
<em class="">2</em>
<a href="https://movie.douban.com/subject/1291546/">
<img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1291546/" class="">
<span class="title">霸王别姬</span>
<span class="other"> / 再见,我的妾 / Farewell My Concubine</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 陈凯歌 Kaige Chen 主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
1993 / 中国大陆 中国香港 / 剧情 爱情 同性
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span property="v:best" content="10.0"></span>
<span>1709666人评价</span>
</div>
<p class="quote">
<span class="inq">风华绝代。</span>
</p>
从这些代码中我们可以发现我们需要爬取的内容都在里面了。
现在我们来分析每一部电影的HTML
语言,即从item项看:
<div class="item">
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title"> / The Shawshank Redemption</span>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span property="v:best" content="10.0"></span>
<span>2304569人评价</span>
</div>
<p class="quote">
<span class="inq">希望让人自由。</span>
</p>
<p>
<span class="gact">
<a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看</a>
</span>
<span class="gact">
<a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过</a>
</span>
</p>
</div>
</div>
</div>
</li>
从《肖申克的救赎》这部电影对应的HTML
代码我们可以发现,我们需要提取的内容为:
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title"> / The Shawshank Redemption</span>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
<span class="rating_num" property="v:average">9.7</span>
<span class="inq">希望让人自由。</span>
分析到这我们就可以确定我们的代码实现思路了。
确定思路
1、确定网页的url
,即:
start_url = 'https://movie.douban.com/top250?start={:d}&filter='
size = 10
for i in range(size):
url = start_url.format(i * 25) # url便是每一页对应的网页链接
2、获取到对应的网页,即通过requests.get()
方法,即
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
html = requests.get(url, headers=headers)
爬取网页的时候可能会出现错误,我们要确保传递给下一个过程的数据没有错误,因此:
if html.status_code == 200:
# 继续下一步操作
pass
else:
print("error!!!")
3、通过get()
方法获取网页代码有时我们无法直接处理,因此我们需要用的beautifulSoup
来解析网页(虽然这一次根本不需要)
soup = BeautifulSoup(html.text, 'html.parser')
4、经过前面的步骤我们就可以开始利用正则表达式来进行数据清洗了,因为之前我们已经确定了需求,所以现在我们按照之前的需求来一次完成我们的目标
4.1 匹配url
res = r'^[\[a-z<="\s]*href="(.*)">$'
if re.match(res, test):
url = re.match(res, test).group(1)
else:
url = 'None'
4.2 匹配名称
res = r'^[<a-z\s=">]*>(.*)</span>$'
if re.match(res, test):
movie_name = re.match(res, test).group(1)
else:
movie_name = 'None'
4.3 匹配评分
res = r'^[\[=a-z"<>\s:_]*(.*)</span>]$'
if re.match(res, test):
rating = re.match(res, test).group(1)
else:
rating = 'None'
4.4 匹配推荐语
res = r'^[\[=a-z"<>\s]*(.*)</span>]$']
if re.match(res, test):
inq = re.match(res, test).group(1)
else:
inq = 'None'
5、经过前面的步骤,我们就可以在从程序输出得到爬取的,但是为了方便阅读和保存,我们需要把爬取结果保存下来,为了方便,直接保存为CSV格式的文件
with open('res.csv', 'w', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['电影名', '评分', '推荐语', '链接'])
for i in res:
writer.writerow(i)
实施阶段
在准备阶段我们已经确定了整个爬取流程,接下来就作为一个无情的代码机器就行了
1、编写getHtmlDiv(url)
函数,用于爬取网页并解析得到div
标签中的结果
def getHtmlDiv(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
html = requests.get(url, headers=headers)
if html.status_code == 200:
soup = BeautifulSoup(html.text, 'html.parser')
return soup.find_all(name='div', class_='info')
else:
print(html.status_code)
2、编写writeToCSV(res, filename)
函数,将爬取结果保存
def writeToCSV(res, filename):
with open(filename, 'w', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['电影名', '评分', '推荐语', '链接'])
for i in res:
writer.writerow(i)
3、编写getRes(ans, url)
函数,从div
标签中的到我们需要的数据
def getRes(ans, url):
div = getHtmlDiv(url)
for i in range(len(div)):
s_url = str(div[i].find_all(name='a')).split('\n')[0]
res_url = r'^[\[a-z<="\s]*href="(.*)">$'
if re.match(res_url, s_url):
movie_url = re.match(res_url, s_url).group(1)
else:
movie_url = "None"
s_title_span = div[i].find_all(name='span', class_='title') + div[i].find_all(name='span',class_='other')
res_title = r'^[<a-z\s=">]*>(.*)</span>$'
movie_name = ''
for j in range(len(s_title_span)):
if re.match(res_title, str(s_title_span[j])):
m = re.match(res_title, str(s_title_span[j])).group(1)
m = ''.join(m.split())
else:
m = "None"
movie_name += m
s_rating = str(div[i].find_all(name='span', class_='rating_num'))
res_rating = r'^[\[=a-z"<>\s:_]*(.*)</span>]$'
if re.match(res_rating, s_rating):
movie_rating = re.match(res_rating, s_rating).group(1)
else:
movie_rating = "None"
s_inq = str(div[i].find_all(name='span', class_='inq'))
res_inq = r'^[\[=a-z"<>\s]*(.*)</span>]$'
if re.match(res_inq, s_inq):
movie_inq = re.match(res_inq, s_inq).group(1)
else:
movie_inq = "None"
item = (movie_name, movie_rating, movie_inq, movie_url)
ans.append(item)
return ans
4、编写main()
函数
def main():
start_url = "https://movie.douban.com/top250?start={:d}&filter="
size = 10
res = []
for i in range(size):
try:
url = start_url.format(i * 25)
res = getRes(res, url)
except:
print("main() error! i = %d" % i)
continue
writeToCSV(res, 'test.csv')
查看结果
在代码所在的文件夹中,我们就可以看到生成了一个.csv
文件,在该文件中就保存了我们的爬取结果。