python爬取MOOC课程信息
MOOC课程信息爬取时间 :2019-10-12一、任务与目标网站地址http://www.imooc.com/course/list/2. 采用scrapy爬虫框架爬取信息包括:课程名称 ,课程图片地址,学习人数,课程的学习人数及最后下载课程的图片 。信息保存格式:josn信息全面:爬取所有都课程信息。二、爬虫相关文件准备与安装python...
MOOC课程信息爬取
- 时间 :2019-10-12
一、任务与目标
-
网站地址
http://www.imooc.com/course/list/
2. 采用scrapy爬虫框架
-
爬取信息包括:课程名称 ,课程图片地址,学习人数,课程的学习人数及最后下载课程的图片 。
-
信息保存格式:josn
-
信息全面:爬取所有都课程信息。
二、爬虫相关文件准备与安装
-
python版本:Python 3.7.4 官网下载
-
操作系统:window 10
scarapy爬虫框架安装
直接打开cmd 输入:
pip install scrapy
如果无法安装建议更新pip
python -m pip install -U pip
三、爬虫项目的部署
注意:scrapy爬虫的创建与启动都采用命令输入。
第一步 电脑搜索cmd进入命令行输入:
cd 你准备放置爬虫的文件夹
例如:cd C:\Users\zhong\Desktop\scrapy\mooc
第二步 建立scrapy项目 cmd输入:
scrapy startproject mooc
mooc 是你的项目名。
切换的mooc
cd mooc
第三步 创建我们的第一个爬虫
scrapy genspider moocscrapy "http://www.imooc.com/course/list/"
打开对应的文件夹,cmd输入:tree /f 生成以下结构树
- scrapy.cfg: 项目的配置文件
- scrapytest/: 该项目的python模块。之后将在此加入代码。
- scrapytest/items.py: 项目中的item文件.
- scrapytest/pipelines.py: 项目中的pipelines文件.
- scrapytest/settings.py: 项目的设置文件.
- scrapytest/spiders/: 放置spider代码的目录.
这就是我们爬虫项目的文件布局了!
四、设计爬虫,爬取信息
编写代码建议使用编辑器(推荐使用Pycharm)
注意:
-name: 用于区别Spider。 该名字必须是唯一的,不可以为不同的Spider设定相同的名字。
-start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
-parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
第一步 创建一个容器储存爬取的数据,打开item.py
import scrapy
class MoocItem(scrapy.Item):
# 课程标题
title = scrapy.Field()
# 课程url
url = scrapy.Field()
# 课程标题图片
image_url = scrapy.Field()
# 课程描述
introduction = scrapy.Field()
# 学习人数
student = scrapy.Field()
# 图片地址
image_path = scrapy.Field()
第二步 编写 Moocscrapy.py文件
#-*- coding: utf-8 -*-
import scrapy
from mooc.items import MoocItem
class MoocscrapySpider(scrapy.Spider):
name = 'moocscrapy'
allowed_domains = ['http://www.imooc.com/course/list/']
start_urls = ['http://www.imooc.com/course/list/']
def parse(self, response):
# 实例化
item = MoocItem()
# 先获取每个课程的div
for box in response.xpath('//div[@class="course-card-container"]/a[@target="_blank"]'):
# 获取每个div中的课程路径
item['url'] = 'http://www.imooc.com' + box.xpath('.//@href').extract()[0]
# 获取div中的课程标题
item['title'] = box.xpath('.//div[@class="course-card-content"]/h3[@class="course-card-name"]//text()').extract()[0]
# 获取div中的标题图片地址
item['image_url'] = "http:" + box.xpath('.//img[@class="course-banner lazy"]/@data-original').extract()[0]
# 获取div中的学生人数
item['student'] = box.xpath('.//span[2]/text()').extract()[0]
# 获取div中的课程简介
item['introduction'] = box.xpath('.//p/text()').extract()[0].strip()
# 返回信息
yield item
注:这里用到了xpath方式来获取页面信息。
-
在parse()方法中response参数返回一个下载好的网页信息,我们然后通过xpath来寻找我们需要的信息。
-
在执行完以上步骤之后,我们可以运行一下爬虫,看看是否出错。在命令行下进入工程文件夹,然后运行:
scrapy crawl moocscrapy
就可以得到以下结果:
第三步 保存爬取到的数据
这时候我们已经成功了一半了!我需要把item的数据写入文件中便于储存。
现在我们需要打开用于数据处理储存的pipelines.py
下面我们将进行数据处理工作
便于处理我们将爬取到的文件写入到data.josn文件中
代码如下:
from scrapy.exceptions import DropItem
import json
class MoocPipeline(object):
def __init__(self):
# 打开文件
self.file = open('data.json', 'w', encoding='utf-8')
# 该方法用于处理数据
def process_item(self, item, spider):
# 读取item中的数据
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
# 写入文件
self.file.write(line)
# 返回item
return item
# 该方法在spider被开启时被调用。
def open_spider(self, spider):
pass
# 该方法在spider被关闭时被调用。
def close_spider(self, spider):
pass
要使用Pipeline,首先要注册Pipeline
找到settings.py文件,这个文件时爬虫的配置文件
在其中添加
ITEM_PIPELINES = {
'mooc.pipelines.MoocPipeline': 1,
}
上面的代码用于注册Pipeline,其中mooc.pipelines.MoocPipeline为你要注册的类,右侧的’1’为该Pipeline的优先级,范围1~1000,越小越先执行。
进行完以上操作,我们的一个最基本的爬取操作就完成了
这时我们再运行
scrapy crawl MySpider
第四步 URL跟进
通过我们的观察一共有29页的数据,而后分析url得出
if self.n<28:
self.n = self.n+1
newurl = 'http://www.imooc.com/course/list/?page='+str(self.n)
然后再运行,查看结果:
以上源码:
mooc/spiders/moocscrapy.py
#-*- coding: utf-8 -*-
import scrapy
from mooc.items import MoocItem
class MoocscrapySpider(scrapy.Spider):
name = 'moocscrapy'
allowed_domains = ['imooc.com']
start_urls = ['http://www.imooc.com/course/list/']
n = 1
def parse(self, response):
# 实例化
item = MoocItem()
# 先获取每个课程的div
for box in response.xpath('//div[@class="course-card-container"]/a[@target="_blank"]'):
# 获取每个div中的课程路径
item['url'] = 'http://www.imooc.com' + box.xpath('.//@href').extract()[0]
# 获取div中的课程标题
item['title'] = box.xpath('.//div[@class="course-card-content"]/h3[@class="course-card-name"]//text()').extract()[0]
# 获取div中的标题图片地址
item['image_url'] = "http:" + box.xpath('.//img[@class="course-banner lazy"]/@data-original').extract()[0]
# 获取div中的学生人数
item['student'] = box.xpath('.//span[2]/text()').extract()[0]
# 获取div中的课程简介
item['introduction'] = box.xpath('.//p/text()').extract()[0].strip()
# 返回信息
yield item
#url跟进
if self.n<28:
self.n = self.n+1
newurl = 'http://www.imooc.com/course/list/?page='+str(self.n)
yield scrapy.Request(newurl,callback=self.parse)
mooc/items.py
import scrapy
class MoocItem(scrapy.Item):
# 课程标题
title = scrapy.Field()
# 课程url
url = scrapy.Field()
# 课程标题图片
image_url = scrapy.Field()
# 课程描述
introduction = scrapy.Field()
# 学习人数
student = scrapy.Field()
# 图片地址
image_path = scrapy.Field()
mooc/middlewares.py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
import random
class UserAgentDownloadMiddleWare(object):
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; pl-PL; rv:1.0.1) Gecko/20021111 Chimera/0.6',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.8 (KHTML, like Gecko, Safari) Cheshire/1.0.UNOFFICIAL',
'Mozilla/5.0 (X11; U; Linux i686; nl; rv:1.8.1b2) Gecko/20060821 BonEcho/2.0b2 (Debian-1.99+2.0b2+dfsg-1)'
]
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent'] = user_agent
class MoocSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class MoocDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
```
**mooc/pipelines.py**
```python
from scrapy.exceptions import DropItem
import json
class MoocPipeline(object):
def __init__(self):
# 打开文件
self.file = open('data.json', 'w', encoding='utf-8')
# 该方法用于处理数据
def process_item(self, item, spider):
# 读取item中的数据
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
# 写入文件
self.file.write(line)
# 返回item
return item
# 该方法在spider被开启时被调用。
def open_spider(self, spider):
pass
# 该方法在spider被关闭时被调用。
def close_spider(self, spider):
pass
mooc/setting.py
# -*- coding: utf-8 -*-
# Scrapy settings for mooc project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'mooc'
SPIDER_MODULES = ['mooc.spiders']
NEWSPIDER_MODULE = 'mooc.spiders'
ITEM_PIPELINES = {
'mooc.pipelines.MoocPipeline': 1,
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mooc (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'mooc.middlewares.UserAgentDownloadMiddleWare': 543,
}
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'mooc.middlewares.MoocSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'mooc.middlewares.MoocDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'mooc.pipelines.MoocPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
更多推荐
所有评论(0)