爬虫实战——豆瓣电影Top250


爬虫实战——豆瓣电影Top250

准备阶段

网页分析

在目标网页直接ctrl+u查看网页源代码(或者F12审查),豆瓣的网页源代码就出现了(非常友好):

<!DOCTYPE html>
<html lang="zh-CN" class="ua-windows ua-webkit">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="renderer" content="webkit">
    <meta name="referrer" content="always">
    <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
    <title>
豆瓣电影 Top 250
</title>
    
    <meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />
    <meta http-equiv="Pragma" content="no-cache">
    ......

然后我们翻到330行左右的位置,有如下代码:

<ol class="grid_view">
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">1</em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎</span>
                                    <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
                                <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.7</span>
                                <span property="v:best" content="10.0"></span>
                                <span>2304569人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">希望让人自由。</span>
                            </p>
                            

    <p>
        
        <span class="gact">
            <a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看</a>
        </span>&nbsp;&nbsp;
        
        <span class="gact">
            <a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过</a>
        </span>&nbsp;&nbsp;
    </p>

                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">2</em>
                    <a href="https://movie.douban.com/subject/1291546/">
                        <img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1291546/" class="">
                            <span class="title">霸王别姬</span>
                                <span class="other">&nbsp;/&nbsp;再见,我的妾  /  Farewell My Concubine</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 陈凯歌 Kaige Chen&nbsp;&nbsp;&nbsp;主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
                            1993&nbsp;/&nbsp;中国大陆 中国香港&nbsp;/&nbsp;剧情 爱情 同性
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.6</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1709666人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">风华绝代。</span>
                            </p>

从这些代码中我们可以发现我们需要爬取的内容都在里面了。

现在我们来分析每一部电影的HTML语言,即从item项看:

<div class="item">
                <div class="pic">
                    <em class="">1</em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎</span>
                                    <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
                                <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.7</span>
                                <span property="v:best" content="10.0"></span>
                                <span>2304569人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">希望让人自由。</span>
                            </p>
                            

    <p>
        
        <span class="gact">
            <a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看</a>
        </span>&nbsp;&nbsp;
        
        <span class="gact">
            <a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过</a>
        </span>&nbsp;&nbsp;
    </p>

                    </div>
                </div>
            </div>
        </li>

从《肖申克的救赎》这部电影对应的HTML代码我们可以发现,我们需要提取的内容为:

<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>    
<span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>    
<span class="rating_num" property="v:average">9.7</span>    
<span class="inq">希望让人自由。</span>    

分析到这我们就可以确定我们的代码实现思路了。

确定思路

1、确定网页的url,即:

start_url = 'https://movie.douban.com/top250?start={:d}&filter='
size = 10
for i in range(size):
    url = start_url.format(i * 25)  # url便是每一页对应的网页链接

2、获取到对应的网页,即通过requests.get()方法,即

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
html = requests.get(url, headers=headers)

爬取网页的时候可能会出现错误,我们要确保传递给下一个过程的数据没有错误,因此:

if html.status_code == 200:
    # 继续下一步操作
    pass
else:
    print("error!!!")

3、通过get()方法获取网页代码有时我们无法直接处理,因此我们需要用的beautifulSoup来解析网页(虽然这一次根本不需要)

soup = BeautifulSoup(html.text, 'html.parser')

4、经过前面的步骤我们就可以开始利用正则表达式来进行数据清洗了,因为之前我们已经确定了需求,所以现在我们按照之前的需求来一次完成我们的目标

4.1 匹配url

res = r'^[\[a-z<="\s]*href="(.*)">$'
if re.match(res, test):
    url = re.match(res, test).group(1)
else:
    url = 'None'

4.2 匹配名称

res = r'^[<a-z\s=">]*>(.*)</span>$'
if re.match(res, test):
    movie_name = re.match(res, test).group(1)
else:
    movie_name = 'None'

4.3 匹配评分

res = r'^[\[=a-z"<>\s:_]*(.*)</span>]$'
if re.match(res, test):
    rating = re.match(res, test).group(1)
else:
    rating = 'None'

4.4 匹配推荐语

res = r'^[\[=a-z"<>\s]*(.*)</span>]$']
if re.match(res, test):
    inq = re.match(res, test).group(1)
else:
    inq = 'None'

5、经过前面的步骤,我们就可以在从程序输出得到爬取的,但是为了方便阅读和保存,我们需要把爬取结果保存下来,为了方便,直接保存为CSV格式的文件

with open('res.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['电影名', '评分', '推荐语', '链接'])
    for i in res:
        writer.writerow(i)

实施阶段

在准备阶段我们已经确定了整个爬取流程,接下来就作为一个无情的代码机器就行了

1、编写getHtmlDiv(url)函数,用于爬取网页并解析得到div标签中的结果

def getHtmlDiv(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
    html = requests.get(url, headers=headers)
    if html.status_code == 200:
        soup = BeautifulSoup(html.text, 'html.parser')
        return soup.find_all(name='div', class_='info')
    else:
        print(html.status_code)

2、编写writeToCSV(res, filename)函数,将爬取结果保存

def writeToCSV(res, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['电影名', '评分', '推荐语', '链接'])
        for i in res:
            writer.writerow(i)

3、编写getRes(ans, url)函数,从div标签中的到我们需要的数据

def getRes(ans, url):
    div = getHtmlDiv(url)
    for i in range(len(div)):
        s_url = str(div[i].find_all(name='a')).split('\n')[0]
        res_url = r'^[\[a-z<="\s]*href="(.*)">$'
        if re.match(res_url, s_url):
            movie_url = re.match(res_url, s_url).group(1)
        else:
            movie_url = "None"

        s_title_span = div[i].find_all(name='span', class_='title') + div[i].find_all(name='span',class_='other')
        res_title = r'^[<a-z\s=">]*>(.*)</span>$'
        movie_name = ''
        for j in range(len(s_title_span)):
            if re.match(res_title, str(s_title_span[j])):
                m = re.match(res_title, str(s_title_span[j])).group(1)
                m = ''.join(m.split())
            else:
                m = "None"
            movie_name += m

        s_rating = str(div[i].find_all(name='span', class_='rating_num'))
        res_rating = r'^[\[=a-z"<>\s:_]*(.*)</span>]$'
        if re.match(res_rating, s_rating):
            movie_rating = re.match(res_rating, s_rating).group(1)
        else:
            movie_rating = "None"


        s_inq = str(div[i].find_all(name='span', class_='inq'))
        res_inq = r'^[\[=a-z"<>\s]*(.*)</span>]$'
        if re.match(res_inq, s_inq):
            movie_inq = re.match(res_inq, s_inq).group(1)
        else:
            movie_inq = "None"


        item = (movie_name, movie_rating, movie_inq, movie_url)
        ans.append(item)
    return ans

4、编写main()函数

def main():
    start_url = "https://movie.douban.com/top250?start={:d}&filter="
    size = 10
    res = []
    for i in range(size):
        try:
            url = start_url.format(i * 25)
            res = getRes(res, url)
        except:
            print("main() error! i = %d" % i)
            continue
    writeToCSV(res, 'test.csv')

查看结果

在代码所在的文件夹中,我们就可以看到生成了一个.csv文件,在该文件中就保存了我们的爬取结果。


文章作者: Amonologue
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Amonologue !
  目录