Scrapy爬取重庆安居客二手房并存入mysql数据库（下）

上篇中我们获取了重庆的一二级区（Scrapy爬取重庆安居客二手房并存入mysql数据库（上）），这一篇我们根据二级区获取相应的二手房信息。初始化数据库创建二手房信息数据库表，house表存放二手房信息，house_price存放价格（定期获取分析价格趋势）：CREATE TABLE `house` (`id` int UNSIGNED AUTO_INCREMENT,`a...

FLY-DUCK

1272人浏览 · 2019-12-04 21:28:46

FLY-DUCK · 2019-12-04 21:28:46 发布

上篇中我们获取了重庆的一二级区（Scrapy爬取重庆安居客二手房并存入mysql数据库（上）），这一篇我们根据二级区获取相应的二手房信息。

初始化数据库

创建二手房信息数据库表，house表存放二手房信息，house_price存放价格（定期获取分析价格趋势）：

CREATE TABLE `house` (
	`id` int UNSIGNED AUTO_INCREMENT,
	`area_id` int NOT NULL DEFAULT 0,
	`area_code` varchar(255) DEFAULT NULL,
  `title` varchar(2000) DEFAULT NULL,
	`unit_price` decimal(19,4) NOT NULL DEFAULT 0,
  `total_price` decimal(19,4) NOT NULL DEFAULT 0,
  `code` varchar(255) DEFAULT NULL,
  `community` varchar(255) DEFAULT NULL,
  `location` varchar(255) DEFAULT NULL,
  `build_years` varchar(255) DEFAULT NULL,
  `floor` varchar(255) DEFAULT NULL,
  `layout` varchar(255) DEFAULT NULL,
  `size` varchar(255) DEFAULT NULL,
  `picture_url` varchar(255) DEFAULT NULL,
  `url` varchar(1024) DEFAULT NULL,
	`created_on` timestamp DEFAULT current_timestamp,
	`updated_on` timestamp DEFAULT current_timestamp on update current_timestamp,
	PRIMARY KEY (id)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

CREATE TABLE `house_price` (
	`id` int UNSIGNED AUTO_INCREMENT,
	`house_id` int NOT NULL DEFAULT 0,
	`house_code` varchar(255) DEFAULT NULL,
	`unit_price` decimal(19,4) NOT NULL DEFAULT 0,
  `total_price` decimal(19,4) NOT NULL DEFAULT 0,
	`created_on` timestamp DEFAULT current_timestamp,
	`updated_on` timestamp DEFAULT current_timestamp on update current_timestamp,
	PRIMARY KEY (id)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

创建项目

创建二手房scrapy项目：

scrapy startproject hourse

项目创建成功后，目录结构如下：

定义item

打开items.py文件，定义二手房信息item：

import scrapy

class HourseItem(scrapy.Item):
    area_code = scrapy.Field()
    title = scrapy.Field()
    unit_price = scrapy.Field()
    total_price = scrapy.Field()
    code = scrapy.Field()
    community = scrapy.Field()
    location = scrapy.Field()
    build_years = scrapy.Field()
    floor = scrapy.Field()
    layout = scrapy.Field()
    size = scrapy.Field()
    picture_url = scrapy.Field()
    url = scrapy.Field()

开发spider

在spiders目录中新建二手房爬虫的spider文件housespider.py，根据页面结构使用css选择器解析处理数据：

import scrapy
import re
import pymysql
from hourse.items import HourseItem


class HouseSpider(scrapy.Spider):
    name = 'house'

    allow_domains = ["anjuke.com"]

    # start_urls = [
    #     'https://chongqing.anjuke.com/sale/p1/',
    # ]

    def __init__(self):
        self.db = pymysql.connect("localhost", "root", "123456", "house", charset="utf8")
        self.cursor = self.db.cursor()

        # 从数据库读取区域，按照区域查询数据
        results = self.get_areas()
        urls = []
        for area_item in results:
            urls.append('https://chongqing.anjuke.com/sale/' + area_item[1] + '/')
            # break

        self.start_urls = urls
        self.enable_next_page = True

    def close(self):
        self.db.close()

    def get_areas(self):
        # 区域数量太多，会触发人机校验，请求连接会跳转到人工操作页面，这里添加条件只获取部分区域
        select_sql = "select * from house_area where parent_id > 0"
        # select_sql = "select * from house_area where id = 1"

        self.cursor.execute(select_sql)

        results = self.cursor.fetchall()

        return results

    def parse(self, response):
        house_list = response.css('li.list-item')

        house_item = HourseItem()
        house_item['area_code'] = ''
        current_url = re.search('sale/([^/]+)/', response.url)
        if current_url:
            house_item['area_code'] = current_url.group(1)

        pat_code = '/([a-zA-Z0-9]+)\?'
        
        for item in house_list:
            house_item['title'] = item.css('.house-title a::text').extract_first().strip()
            house_item['url'] = item.css('.house-title a::attr(href)').extract_first().strip()
            # 总价万为单位
            house_item['total_price'] = re.search('\d+', item.css('.pro-price .price-det').extract_first().strip()).group()
            house_item['code'] = ''
            if house_item['url']:
                search_code = re.search(pat_code, house_item['url'])
                if search_code:
                    house_item['code'] = search_code.group(1)
            house_item['community'] = ''
            house_item['location'] = item.css('.details-item:nth-child(3) span.comm-address::attr(title)').extract_first().replace(u'\xa0', u' ').strip()
            if house_item['location']:
                address_items = re.split('\s+', house_item['location'])
                if len(address_items) > 1:
                    house_item['community'] = address_items[0]

            house_item['build_years'] = item.css('.details-item:nth-child(2) span:nth-child(7)::text').extract_first().strip()
            house_item['floor'] = item.css('.details-item:nth-child(2) span:nth-child(5)::text').extract_first().strip()
            house_item['layout'] = item.css('.details-item:nth-child(2) span:nth-child(1)::text').extract_first().strip()
            house_item['size'] = item.css('.details-item:nth-child(2) span:nth-child(3)::text').extract_first().strip()
            house_item['picture_url'] = item.css('.item-img img::attr(src)').extract_first().strip()

            house_item['unit_price'] = re.search('\d+', item.css('.pro-price .unit-price::text').extract_first().strip()).group()

            yield house_item
        
        #分页操作
        next_page = response.css('.multi-page a.aNxt::attr(href)').extract_first()
        if self.enable_next_page and next_page:
            #构建新的Request对象
            next_url = next_page.strip()
            yield scrapy.Request(next_url, callback=self.parse)

开发pipeline

打开pipelines.py文件，将spider中爬取处理的数据存入数据库：

import pymysql

class HoursePipeline(object):
    def __init__(self):
        self.db = pymysql.connect("localhost", "root", "123456", "house", charset="utf8")
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        select_area_sql = "select id from house_area where code='%s'" % item['area_code']
        is_area_exist = self.cursor.execute(select_area_sql)
        house_area = self.cursor.fetchone()
        area_id = 0
        if house_area:
            area_id = house_area[0]

        select_sql = "select id from house where code='%s'" % item['code']
        already_save = self.cursor.execute(select_sql)
        house_item = self.cursor.fetchone()
        self.db.commit()

        if already_save == 1:
            # 更新信息
            house_id = house_item[0]
            update_sql = "update house set title='%s', unit_price='%d', total_price='%d', url='%s' where id='%d'" % (item['title'],int(item['unit_price']),int(item['total_price']),item['url'],int(house_id))
            self.cursor.execute(update_sql)
            self.db.commit()

            # 插入价格
            self.add_price(house_id, item['code'], item['unit_price'], item['total_price'])  
        else:
            # 插入信息
            sql = "insert into house(area_id,area_code,title,unit_price,total_price,code,community,location,build_years,floor,layout,size,picture_url,url)\
                values('%d','%s','%s','%d','%d','%s','%s','%s','%s','%s','%s','%s','%s','%s')"\
                %(area_id,item['area_code'],item['title'],int(item['unit_price']),int(item['total_price']),item['code'],item['community'],item['location'],\
                    item['build_years'],item['floor'], item['layout'],item['size'],item['picture_url'],item['url'])
            self.cursor.execute(sql)

            house_id = int(self.db.insert_id())
            self.db.commit()

            # 插入价格
            self.add_price(house_id, item['code'], item['unit_price'], item['total_price'])  
        return item

    def add_price(self, house_id, house_code, unit_price, total_price):
        sql = "insert into house_price(house_id, house_code, unit_price, total_price)\
            values('%d','%s','%d','%d')" % (int(house_id), house_code, int(unit_price), int(total_price))
        self.cursor.execute(sql)
        self.db.commit() 

    def __del__(self):
        self.db.close()

设置settings

打开settings.py文件，这只项目的USER_AGENT和ITEM_PIPELINES，其余设置保持不变：

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'

ITEM_PIPELINES = {
   'hourse.pipelines.HoursePipeline': 300,
}

启动爬虫

在项目根目录下，在命令行中运行如下命令开始爬虫程序（博主在vscode中项目下直接运行的命令）：

scrapy crawl house

至此使用scrapy获取重庆的二手房数据已经完成，博主才入坑python，通过该项目学习了scrapy框架和一些基础的python语法。如果有不妥的笛梵，欢迎大家提意见。

项目的Github地址：https://github.com/liuhuifly/scrapy_cqanjuke，数据库初始化脚本在hourse项目的sqlscripts目录下。

声明：本项目仅仅供学习使用，使用该项目从事的一切商业行为与博主无关，自行承担责任。

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

自动化提示词生成工具盘点

腾讯云开发者社区

AI 浪潮下的锚与帆：工程师文化的变与不变 | 架构师夜生活

腾讯云开发者社区

腾讯云架构师技术沙龙 · 长沙站圆满落幕，共话AI驱动下的技术架构与前沿应用

人工智能已成为推动技术创新与产业变革的重要引擎，开发者正身处一场前所未有的技术变革之中。通过本次腾讯云架构师技术沙龙，各位专家深入分享前沿技术洞察，探讨 AI 落地的应用路径与实践经验，为架构师的职业发展指明方向。腾讯云架构师长沙同盟和腾讯云架构师技术同盟长沙地区理事会正式成立。未来，腾讯云架构师长沙同盟将凝心聚力，打造属于本地架构师的学习与成长的家园，助力中国架构的蓬勃发展。未来已来，让我们携手