創(chuàng)建crawlspider爬蟲文件:
scrapy genspider -t crawl 爬蟲文件名 爬取的域名
scrapy genspider -t crawl read https://www.dushu.com/book/1206.html
LinkExtractor 鏈接提取器通過它,Spider可以知道從爬取的頁面中提取出哪些鏈接,提取出的鏈接會自動生成Request請求對象
class ReadSpider(CrawlSpider):
name = "read"
allowed_domains = ["www.dushu.com"]
start_urls = ["https://www.dushu.com/book/1206_1.html"]
# LinkExtractor 鏈接提取器通過它,Spider可以知道從爬取的頁面中提取出哪些鏈接。提取出的鏈接會自動生成Request請求對象
rules = (Rule(LinkExtractor(allow=r"/book/1206_\d+\.html"), callback="parse_item", follow=False),)
def parse_item(self, response):
name_list = response.xpath('//div[@class="book-info"]//img/@alt')
src_list = response.xpath('//div[@class="book-info"]//img/@data-original')
for i in range(len(name_list)):
name = name_list[i].extract()
src = src_list[i].extract()
book = ScarpyReadbook41Item(name=name, src=src)
yield book
開啟管道
寫入文件文章來源:http://www.zghlxwxcb.cn/news/detail-665567.html
class ScarpyReadbook41Pipeline:
def open_spider(self, spider):
self.fp = open('books.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self, spider):
self.fp.close()
運行之后發(fā)現(xiàn)沒有第一頁數(shù)據(jù)
需要在start_urls里加上_1,不然不會讀取第一頁數(shù)據(jù)文章來源地址http://www.zghlxwxcb.cn/news/detail-665567.html
start_urls = ["https://www.dushu.com/book/1206_1.html"]
到了這里,關(guān)于Python爬蟲——scrapy_crawlspider讀書網(wǎng)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!