[爬虫]使用python抓取京东全站数据（商品，店铺，分类，评论）

大数据苦行僧—yisurvey123

3588人浏览 · 2020-11-09 18:44:40

大数据苦行僧—yisurvey123 · 2020-11-09 18:44:40 发布

网上抓取京东数据的文章，现在要么无法抓取数据，要么只能抓取部分数据，本文将介绍如何抓取京东全站数据，包括商品信息、店铺信息，评论信息，分类信息等。

一、环境
OS：win10
python：3.5
scrapy：1.3.2
pymongo：3.2
pycharm
环境搭建，自行百度

二、数据库说明

产品分类
京东大概有1183个分类，这是除去了一些虚拟产品（话费、彩票、车票等）的分类，可以到如下网页查看：
https://www.jd.com/allSort.aspx
我们也是从这个网址开始抓取。由于这些分类里面也有属于频道的页面，也就是说，这个分类里面也有很多子分类，需要做一些特殊处理才可以拿到所有分类，具体方法，下文再说。

name #分类名称
url #分类url
_id #分类id

产品

url #产品url
_id #产品id
category #产品分类
reallyPrice #产品价格
originalPrice #原价
description #产品描述
shopId #shop id
venderId #vender id
commentCount #评价总数
goodComment #好评数
generalComment #中评数
poolComment #差评数
favourableDesc1 #优惠描述1
favourableDesc2 #优惠描述2

_id #评论id
productId #产品id
guid
content #评论内容
creationTime #评论时间
isTop
referenceId
referenceName
referenceType
referenceTypeId
firstCategory
secondCategory
thirdCategory
replyCount #回复次数
score #分数
status
title
usefulVoteCount #被标记的有用评论数
uselessVoteCount #被标记的无用评论数
userImage
userImageUrl
userLevelId
userProvince
viewCount
orderId #订单id
isReplyGrade
nickname #评论人的名称
userClient
mergeOrderStatus
discussionId
productColor
productSize
imageCount #评论中图片的数量
integral
userImgFlag
anonymousFlag
userLevelName
plusAvailable
recommend
userLevelColor
userClientShow
isMobile #是否移动端评论
days
afterDays #追加评论数

店铺

店铺有别名的，一般有两个url，例如宝梦旗舰店：
url1：http://mall.jd.com/index-596056.html
url2: https://baomeng.jd.com/

_id #店铺名称
name #店铺名称
url1 #店铺url1
url2 #店铺url2
shopId #shop id
venderId #vender id

评论总结

_id
goodRateShow #好评率
poorRateShow #差评率
poorCountStr #差评数字符串
averageScore #平均分
generalCountStr #中评数字符串
showCount
showCountStr
goodCount #好评数
generalRate #中评率
generalCount #中评数
skuId
goodCountStr #好评数字符串
poorRate #差评率
afterCount #追评数
goodRateStyle
poorCount
skuIds
poorRateStyle
generalRateStyle
commentCountStr
commentCount
productId #产品id
afterCountStr
goodRate
generalRateShow
jwotestProduct
maxPage
score
soType
imageListCount

三、抓取说明

抓取分类
代码如下：

def parse_category(self, response):
“”“获取分类页”""
selector = Selector(response)
try:
texts = selector.xpath(’//div[@class=“category-item m”]/div[@class=“mc”]/div[@class=“items”]/dl/dd/a’).extract()
for text in texts:
items = re.findall(r’(.*?)’, text)
for item in items:
if item[0].split(’.’)[0][2:] in key_word:
if item[0].split(’.’)[0][2:] != ‘list’:
yield Request(url=‘https:’ + item[0], callback=self.parse_category)
else:
categoriesItem = CategoriesItem()
categoriesItem[‘name’] = item[1]
categoriesItem[‘url’] = ‘https:’ + item[0]
categoriesItem[’_id’] = item[0].split(’=’)[1].split(’&’)[0]
yield categoriesItem
yield Request(url=‘https:’ + item[0], callback=self.parse_list)
except Exception as e:
print(‘error:’, e)

如前文所说，有些类别里面包含有很多子类别，所以对于这样的url，需要再次进行类别抓取：

if item[0].split(’.’)[0][2:] != ‘list’:
yield Request(url=‘https:’ + item[0], callback=self.parse_category)

抓取产品
访问每个类别的url就可以获取得到产品列表，找到产品的URL，进入详情页面抓取产品的详情：

def parse_list(self, response):
“”“分别获得商品的地址和下一页地址”""
meta = dict()
meta[‘category’] = response.url.split(’=’)[1].split(’&’)[0]
    selector = Selector(response)
    texts = selector.xpath('//*[@id="plist"]/ul/li/div/div[@class="p-img"]/a').extract()
    for text in texts:
        items = re.findall(r'<a target="_blank" href="(.*?)">', text)
        yield Request(url='https:' + items[0], callback=self.parse_product, meta=meta)

产品的基本信息在详情页面基本可以获取，但是有些信息，比如：价格、优惠政策等信息，是需要动态获取的。

先来看价格信息，访问的URL格式为：

https://p.3.cn/prices/mgets?skuIds=J_(product_id)

这个url最后括号里面的信息就是产品的id，需要动态获取，代码如下：

response = requests.get(url=price_url + product_id) price_json =
response.json() productsItem[‘reallyPrice’] = price_json[0][‘p’]
productsItem[‘originalPrice’] = price_json[0][‘m’]

获取得到的都是json格式，比较好解析。

再来看优惠信息，优惠信息分为两种：优惠券和满减描述：

在这里插入图片描述
所以需要抓取这两种信息，都是动态加载，代码如下：

优惠 res_url = favourable_url % (product_id, shop_id, vender_id, category.replace(’,’, ‘%2c’))

print(res_url) response = requests.get(res_url) fav_data = response.json() if fav_data[‘skuCoupon’]:
desc1 = []
for item in fav_data['skuCoupon']:
    start_time = item['beginTime']
    end_time = item['endTime']
    time_dec = item['timeDesc']
    fav_price = item['quota']
    fav_count = item['discount']
    fav_time = item['addDays']
    desc1.append(u'有效期%s至%s,满%s减%s' % (start_time, end_time, fav_price, fav_count))
productsItem['favourableDesc1'] = ';'.join(desc1)
if fav_data[‘prom’] and fav_data[‘prom’][‘pickOneTag’]:
desc2 = []
for item in fav_data[‘prom’][‘pickOneTag’]:
desc2.append(item[‘content’])
productsItem[‘favourableDesc1’] = ‘;’.join(desc2)

抓取店铺信息
在每个产品的详情页面都可以直接找到店铺id和vender id：

ids = re.findall(r"venderId:(.?),\s.?shopId:’(.?)’", response.text)
if not ids:
ids = re.findall(r"venderId:(.?),\s.?shopId:(.?),", response.text)
vender_id = ids[0][0]
shop_id = ids[0][1]

店铺的名称比较难取，有多种不同页面，店铺标题也在不同地方，而且自营产品，在详情页面也可以店铺名称，代码如下：

try:
name = response.xpath(’//ul[@class=“parameter2 p-parameter-list”]/li/a//text()’).extract()[0] except:
try:
name = response.xpath(’//div[@class=“name”]/a//text()’).extract()[0].strip()
except:
try:
name = response.xpath(’//div[@class=“shopName”]/strong/span/a//text()’).extract()[0].strip()
except:
try:
name = response.xpath(’//div[@class=“seller-infor”]/a//text()’).extract()[0].strip()
except:
name = u’京东自营’

抓取评论
评论的信息也是动态加载，返回的格式也是json，访问url格式为：

https://club.jd.com/comment/productPageComments.action?productId=(product_id)&score=0&sortType=5&page=%s&pageSize=10

只需要产品的ID即可。

获取评论信息代码如下：

“”“获取商品comment”""
try:
data = json.loads(response.text)
except Exception as e:
print(‘get comment failed:’, e)
return None

   product_id = response.meta['product_id']

   commentSummaryItem = CommentSummaryItem()
   commentSummary = data.get('productCommentSummary')
   commentSummaryItem['goodRateShow'] = commentSummary.get('goodRateShow')
   commentSummaryItem['poorRateShow'] = commentSummary.get('poorRateShow')
   commentSummaryItem['poorCountStr'] = commentSummary.get('poorCountStr')
   commentSummaryItem['averageScore'] = commentSummary.get('averageScore')
   commentSummaryItem['generalCountStr'] = commentSummary.get('generalCountStr')
   commentSummaryItem['showCount'] = commentSummary.get('showCount')
   commentSummaryItem['showCountStr'] = commentSummary.get('showCountStr')
   commentSummaryItem['goodCount'] = commentSummary.get('goodCount')
   commentSummaryItem['generalRate'] = commentSummary.get('generalRate')
   commentSummaryItem['generalCount'] = commentSummary.get('generalCount')
   commentSummaryItem['skuId'] = commentSummary.get('skuId')
   commentSummaryItem['goodCountStr'] = commentSummary.get('goodCountStr')
   commentSummaryItem['poorRate'] = commentSummary.get('poorRate')
   commentSummaryItem['afterCount'] = commentSummary.get('afterCount')
   commentSummaryItem['goodRateStyle'] = commentSummary.get('goodRateStyle')
   commentSummaryItem['poorCount'] = commentSummary.get('poorCount')
   commentSummaryItem['skuIds'] = commentSummary.get('skuIds')
   commentSummaryItem['poorRateStyle'] = commentSummary.get('poorRateStyle')
   commentSummaryItem['generalRateStyle'] = commentSummary.get('generalRateStyle')
   commentSummaryItem['commentCountStr'] = commentSummary.get('commentCountStr')
   commentSummaryItem['commentCount'] = commentSummary.get('commentCount')
   commentSummaryItem['productId'] = commentSummary.get('productId')  # 同ProductsItem的id相同
   commentSummaryItem['_id'] = commentSummary.get('productId')
   commentSummaryItem['afterCountStr'] = commentSummary.get('afterCountStr')
   commentSummaryItem['goodRate'] = commentSummary.get('goodRate')
   commentSummaryItem['generalRateShow'] = commentSummary.get('generalRateShow')
   commentSummaryItem['jwotestProduct'] = data.get('jwotestProduct')
   commentSummaryItem['maxPage'] = data.get('maxPage')
   commentSummaryItem['score'] = data.get('score')
   commentSummaryItem['soType'] = data.get('soType')
   commentSummaryItem['imageListCount'] = data.get('imageListCount')
   yield commentSummaryItem

   for hotComment in data['hotCommentTagStatistics']:
       hotCommentTagItem = HotCommentTagItem()
       hotCommentTagItem['_id'] = hotComment.get('id')
       hotCommentTagItem['name'] = hotComment.get('name')
       hotCommentTagItem['status'] = hotComment.get('status')
       hotCommentTagItem['rid'] = hotComment.get('rid')
       hotCommentTagItem['productId'] = hotComment.get('productId')
       hotCommentTagItem['count'] = hotComment.get('count')
       hotCommentTagItem['created'] = hotComment.get('created')
       hotCommentTagItem['modified'] = hotComment.get('modified')
       hotCommentTagItem['type'] = hotComment.get('type')
       hotCommentTagItem['canBeFiltered'] = hotComment.get('canBeFiltered')
       yield hotCommentTagItem

   for comment_item in data['comments']:
       comment = CommentItem()

       comment['_id'] = comment_item.get('id')
       comment['productId'] = product_id
       comment['guid'] = comment_item.get('guid')
       comment['content'] = comment_item.get('content')
       comment['creationTime'] = comment_item.get('creationTime')
       comment['isTop'] = comment_item.get('isTop')
       comment['referenceId'] = comment_item.get('referenceId')
       comment['referenceName'] = comment_item.get('referenceName')
       comment['referenceType'] = comment_item.get('referenceType')
       comment['referenceTypeId'] = comment_item.get('referenceTypeId')
       comment['firstCategory'] = comment_item.get('firstCategory')
       comment['secondCategory'] = comment_item.get('secondCategory')
       comment['thirdCategory'] = comment_item.get('thirdCategory')
       comment['replyCount'] = comment_item.get('replyCount')
       comment['score'] = comment_item.get('score')
       comment['status'] = comment_item.get('status')
       comment['title'] = comment_item.get('title')
       comment['usefulVoteCount'] = comment_item.get('usefulVoteCount')
       comment['uselessVoteCount'] = comment_item.get('uselessVoteCount')
       comment['userImage'] = 'http://' + comment_item.get('userImage')
       comment['userImageUrl'] = 'http://' + comment_item.get('userImageUrl')
       comment['userLevelId'] = comment_item.get('userLevelId')
       comment['userProvince'] = comment_item.get('userProvince')
       comment['viewCount'] = comment_item.get('viewCount')
       comment['orderId'] = comment_item.get('orderId')
       comment['isReplyGrade'] = comment_item.get('isReplyGrade')
       comment['nickname'] = comment_item.get('nickname')
       comment['userClient'] = comment_item.get('userClient')
       comment['mergeOrderStatus'] = comment_item.get('mergeOrderStatus')
       comment['discussionId'] = comment_item.get('discussionId')
       comment['productColor'] = comment_item.get('productColor')
       comment['productSize'] = comment_item.get('productSize')
       comment['imageCount'] = comment_item.get('imageCount')
       comment['integral'] = comment_item.get('integral')
       comment['userImgFlag'] = comment_item.get('userImgFlag')
       comment['anonymousFlag'] = comment_item.get('anonymousFlag')
       comment['userLevelName'] = comment_item.get('userLevelName')
       comment['plusAvailable'] = comment_item.get('plusAvailable')
       comment['recommend'] = comment_item.get('recommend')
       comment['userLevelColor'] = comment_item.get('userLevelColor')
       comment['userClientShow'] = comment_item.get('userClientShow')
       comment['isMobile'] = comment_item.get('isMobile')
       comment['days'] = comment_item.get('days')
       comment['afterDays'] = comment_item.get('afterDays')
       yield comment

       if 'images' in comment_item:
           for image in comment_item['images']:
               commentImageItem = CommentImageItem()
               commentImageItem['_id'] = image.get('id')
               commentImageItem['associateId'] = image.get('associateId')  # 和CommentItem的discussionId相同
               commentImageItem['productId'] = image.get('productId')  # 不是ProductsItem的id，这个值为0
               commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl')
               commentImageItem['available'] = image.get('available')
               commentImageItem['pin'] = image.get('pin')
               commentImageItem['dealt'] = image.get('dealt')
               commentImageItem['imgTitle'] = image.get('imgTitle')
               commentImageItem['isMain'] = image.get('isMain')
               yield commentImageItem

   # next page
   for i in range(1, int(data['maxPage'])):
       url = comment_url % (product_id, str(i))
       meta = dict()
       meta['product_id'] = product_id
       yield Request(url=url, callback=self.parse_comments2, meta=meta)

抓取过程
基本代码已经在文中贴出，写的比较乱，欢迎大家一起讨论。
了解更多分析及淘宝数据抓取可查看：
http://cloud.yisurvey.com:9081//html/37be8794-b79e-4511-9d0a-81f082bac606.html
本文转载自互联网、仅供学习交流，内容版权归原作者所有，如涉作品、版权和其他问题请联系我们删除处理。
特别说明：本文旨在技术交流，请勿将涉及的技术用于非法用途，否则一切后果自负。如果您觉得我们侵犯了您的合法权益，请联系我们予以处理。

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git