记录scrapy的一些笔记

scrapy是python开发的一个快速web爬虫抓取框架,用于抓取web站点并从页面中提取结构化的数据。scrapy用途广泛,可以用于数据挖掘,监测和自动化测试。下面我们通过抓取豆瓣电影top排行榜来熟悉scrapy。

img

scrapy官方介绍是:

	An open source and collaborative framework for extracting the data you need from websites.
	In a fast, simple, yet extensible way.

首先呢,我们使用scrapy创建一个项目:

	scrapy startproject topmovie

然后scrapy会创建好我们的项目基本结构。

scrapy文件结构一般如下:

	scrapy.cfg
	myproject/
	    __init__.py
	    items.py
	    pipelines.py
	    settings.py
	    spiders/
	        __init__.py
	        ...

items.py定义需要抓取并需要后期处理的数据。来看我们的items.py文件:

	from scrapy import Item, Field

	class TopmovieItem(Item):
	    title = Field()
	    movieInfo = Field()
	    star = Field()
	    quote = Field()

Item就是我们保存爬取到的数据的容器。

title类似字典中的“键”,爬到的数据类似字典中的“值”。

官方手册:

	Item objects are simple containers used to collect the scraped data. They provide a dictionary-licke API with a convenient syntax for declaring their available fields.

settings.py文件配置scrapy,从而修改user-agent,设定爬取时间间隔,设置代理,配置各种中间件等等。

	The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.

例如:

	BOT_NAME = 'topmovie'

	SPIDER_MODULES = ['topmovie.spiders']
	NEWSPIDER_MODULE = 'topmovie.spiders'

	# Crawl responsibly by identifying yourself (and your website) on the user-agent
	USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86 Safari/537.36'

	# 建一个csv文件并将爬取的信息保存到这个文件下面
	FEED_URI = '/Users/puronglong/Desktop/result.csv'
	FEED_FORMAT = 'CSV'

pipeline.py用于存放执行后期数据处理的功能,从而使得数据的爬取和处理分开。

	After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

我们这里并没有进行什么操作所以使用默认的:

	class TopmoviePipeline(object):
    def process_item(self, item, spider):
        return item

了解了文件结构之后然后在spiders目录下编辑我们的爬虫代码:

	import scrapy
	from scrapy.spiders import CrawlSpider
	from scrapy.http import Request
	from scrapy.selector import Selector
	from topmovie.items import TopmovieItem

	class Douban(CrawlSpider):

	    #这个name的值定义的就是scrapy crawl name里面的name的值
	    name = 'topmovie'
	    redis_key = 'topmovie:start_urls'

	    #爬虫默认入口地址
	    start_urls = ['http://movie.douban.com/top250']
	    url = 'http://movie.douban.com/top250'

	    def parse(self, response):
	        item = TopmovieItem()
	        selector = Selector(response)

	        #每一个电影的相关信息都在一个class为info的div下面
	        Movies = selector.xpath('//div[@class="info"]')

	        for eachMovie in Movies:
	            title = eachMovie.xpath('div[@class="hd"]/a/span/text()').extract()
	            fullTitle = ''
	            for each in title:
	                fullTitle += each

	            MovieInfo = eachMovie.xpath('div[@class="bd"]/p/text()').extract()
	            star = eachMovie.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
	            quote = eachMovie.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()

	            # quote可能为空,进行判断,quote是电影中的一句名言
	            if quote:
	                quote = quote[0]
	            else:
	                quote = ''

				#把获取到的值赋给我们的item
	            item['title'] = fullTitle
	            item['movieInfo'] = MovieInfo[0] + '\n' + MovieInfo[1]
	            item['star'] = star
	            item['quote'] = quote

	            #使用yield进行迭代
	            yield item

			#获取到下一页的链接
	        nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()

	        if nextLink:
	            nextLink = nextLink[0]
	            print nextLink

	            #获取新的链接,再次调用此函数
	            yield Request(self.url + nextLink, callback=self.parse)

yield是一个类似return的关键字,只是这个函数返回的是个生成器。

看看爬取的结果吧:

img

取下来了,棒棒哒~