疏窗

  • 首页
  • 生活
  • Java
  • Python
  • Golang
  • 其他
  • 归档

  • 搜索
leetcode jdk 生活 nas nosql 数据库 线程 爬虫 中间件

scrapy学习记录(练手)

发表于 2020-05-26 | 分类于 Python | 0 | 阅读次数 1317

项目开源地址(同步博客更新)
推荐读的书

一、基于halo的Vno主题的博客网站

代码主要实现了爬取首页所有文章的标题及地址

核心代码

class itest(scrapy.Spider):
    name = "itest"
    start_urls = ['https://againriver.com']
    def parse(self, response):
        source = response.css("ol.post-list>li")
        list = self.getInfo(source)
        self.log("结果:%s" % list)
    def getInfo(self,source):
        list = []
        for i in source:
            dict = {}
            title = i.css("h2.post-list__post-title>a::attr(title)").extract()
            href = i.css("h2.post-list__post-title>a::attr(href)").extract()
            dict["title"] = title
            dict["href"] = href
            list.append(dict)
        return list

二、对示例一的加强练习

增加翻页、文章详情页内容并保存文章内容到本地

class itest(scrapy.Spider):
    name = "itest"
    start_urls = ['https://againriver.com/#blog']
    
    def parse(self, response):
        source = response.css("ol.post-list>li")
        list = []
        for i in source:
            dict = {}
            title = i.css("h2.post-list__post-title>a::attr(title)").extract_first()
            href = i.css("h2.post-list__post-title>a::attr(href)").extract_first()
	    #标题获取
            dict["title"] = title
	    #链接获取
            dict["href"] = href
            list.append(dict)
            yield scrapy.Request(url=str(dict["href"]), meta={"dict": dict}, callback=self.parseContent)
	#翻页
        next_page = response.css('a.pagination__older::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
    #文章详情
    def parseContent(self, response):
        dict = response.meta["dict"]
        dict["content"] = response.css("article.post-container>section.post").extract_first()
        self.log("aa:%s" % dict)
        filename = str(dict["title"])
        with open(filename, 'w',encoding='utf-8') as f:
            f.write(str(dict["content"]))
        self.log("保存文件:%s" % filename)
        return dict

三、增加代理池

基于scrapy自带下载中间件

#  middlewares.py新增代理中间件
class HttpbinProxyMiddleware(object):
    def process_request(self, request, spider):
	#http://localhost:5010/get/为开源的代理ip服务(需自行搭建服务)
	#开源地址   https://github.com/jhao104/proxy_pool
        pro_addr = requests.get('http://localhost:5010/get/').json()["proxy"]
        request.meta['proxy'] = 'https://' + pro_addr

#  setting.py 指定权重
DOWNLOADER_MIDDLEWARES = {
   'itest.middlewares.HttpbinProxyMiddleware': 543,
}

下载中间件官方文档

打赏作者
疯子虾夫 微信支付

微信支付

疯子虾夫 支付宝

支付宝

  • 本文作者: 疯子虾夫
  • 本文链接: https://hefengwei.com/archives/1590485396
  • 版权声明: 本博客所有文章除特别声明外,均采用CC BY-NC-SA 3.0 许可协议。转载请注明出处!
# 爬虫
scrapy学习记录(命令行工具)
scrapy学习记录(使用selenium调用浏览器)
  • 文章目录
  • 站点概览
疯子虾夫

疯子虾夫

24 日志
5 分类
9 标签
RSS
Creative Commons
© 2025 疯子虾夫
由 Halo 强力驱动
|
主题 - NexT.Mist v5.1.4
赣ICP备2024026242号

粤公网安备 44010602005909号