Scrapy爬取重庆安居客二手房并存入mysql数据库(上)
scrapy是什么Scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。官网地址:https://scrapy.org/官方文档:https://docs.scrapy.org/en/latest/Github:https://github.com/scrapy/scrapy准备工作本项目环境:python...
scrapy是什么
Scrapy是Python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。
官网地址:https://scrapy.org/
官方文档:https://docs.scrapy.org/en/latest/
Github:https://github.com/scrapy/scrapy
准备工作
本项目环境:python3.7.0,mysql8.0.18
windows系统中安装python环境:Windows平台安装Python环境
windows系统中安装mysql数据库:Windows平台安装MySQL数据库
使用pip安装pymysql和scrapy包:
pip install pymysql
pip install scrapy
首先在mysql数据库中创建house_area表来存放市区,然后我们根据每个区来获取对应的数据,创建house_area表的sqlscript如下:
CREATE TABLE `house_area` (
`id` int UNSIGNED AUTO_INCREMENT,
`code` varchar(255) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`parent_id` int not null DEFAULT 0,
`parent_code` varchar(255) DEFAULT NULL,
`display_order` INT NOT NULL DEFAULT 0,
`created_on` timestamp DEFAULT current_timestamp,
`updated_on` timestamp DEFAULT current_timestamp on update current_timestamp,
PRIMARY KEY (id)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
项目实战
根据官方文档使用scrapy指令快速创建获取市区的scrapy项目:
scrapy startproject hourse_area
创建项目后,目录结构如下:
定义市区的item,如上图所示,我们直接在items.py文件中定义:
class HourseAreaItem(scrapy.Item):
code = scrapy.Field()
name = scrapy.Field()
parent_code = scrapy.Field()
display_order = scrapy.Field()
开发获取市区的爬虫spider,在spiders目录中创建areaspider.py爬虫文件,主要获取重庆市的一二级区:
import scrapy
import re
from hourse_area.items import HourseAreaItem
class AreaSpider(scrapy.Spider):
name = 'area'
allow_domains = ["anjuke.com"]
start_urls = [
'https://chongqing.anjuke.com/sale/',
]
def parse(self, response):
area_lists = response.css('div.div-border.items-list div.items:first-child .elems-l a')
area_item = HourseAreaItem()
display_order = 1
for item in area_lists:
href = item.css('::attr(href)').extract_first().strip()
area_item['code'] = href.replace('https://chongqing.anjuke.com/sale/','').replace('/','')
area_item['name'] = item.css('::text').extract_first().strip()
area_item['parent_code'] = ''
area_item['display_order'] = display_order
display_order += 1
yield area_item
yield scrapy.Request(href, callback=self.parse_subarea, meta={'parent_code': area_item['code']})
def parse_subarea(self, response):
subarea_lists = response.css('div.div-border.items-list div.items:first-child .elems-l .sub-items a')
area_item = HourseAreaItem()
display_order = 1
for item in subarea_lists:
href = item.css('::attr(href)').extract_first().strip()
area_item['code'] = href.replace('https://chongqing.anjuke.com/sale/','').replace('/','')
area_item['name'] = item.css('::text').extract_first().strip()
area_item['parent_code'] = response.meta['parent_code']
area_item['display_order'] = display_order
display_order += 1
yield area_item
开发item pipeline保存爬取的重庆市一二级区的数据到mysql数据库中,打开pipelines.py文件开发存储逻辑:
import pymysql
class HourseAreaPipeline(object):
def __init__(self):
self.db = pymysql.connect("localhost", "root", "123456", "house", charset="utf8")
self.cursor = self.db.cursor()
def __del__(self):
self.db.close()
def process_item(self, item, spider):
select_sql = "select id from house_area where code='%s'" % item['code']
already_save = self.cursor.execute(select_sql)
self.db.commit()
if already_save == 1:
# 更新
update_sql = "update house_area set name='%s' where code='%s'" % (item['name'], item['code'])
self.cursor.execute(update_sql)
self.db.commit()
else:
parent_id = 0
# 查询父级区域
if item['parent_code']:
select_sql = "select id from house_area where code='%s'" % item['parent_code']
already_save = self.cursor.execute(select_sql)
house_area = self.cursor.fetchone()
self.db.commit()
if already_save == 1:
parent_id = house_area[0]
# 插入
insert_sql = "insert into house_area(code,name,parent_id,parent_code,display_order)\
values('%s','%s','%d','%s','%d')"\
%(item['code'],item['name'],parent_id,item['parent_code'],item['display_order'])
self.cursor.execute(insert_sql)
self.db.commit()
return item
设置项目的settings,打开settings.py文件设置如下两项,其他的保持默认即可:
# 前置代码
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
# 后置代码
ITEM_PIPELINES = {
'hourse_area.pipelines.HourseAreaPipeline': 300,
}
# 后置代码
在项目的根目录下运行scrapy指令开始爬取数据:
scrapy crawl area
如此重庆市的一二级区的数据我们就获取并加工保存在mysql数据库中。下一篇文章介绍如何根据区获取对应的数据。
声明:本项目仅仅供学习使用,使用该项目从事的一切商业行为与博主无关,自行承担责任。
更多推荐
所有评论(0)