python爬虫之使用selenium爬取京东商品信息并把数据保存至mongodb数据库
1.使用工具python3pymongomongodbseleniumchrome浏览器2.具体分析2.1 URL分析打开京东首页,在搜索框任意输入一个商品名称,这里以华为最新发布的手机华为p50为例,点击搜索,页面如下所示:可能会出现登录界面,可以先登录一下:进入首页后,先记录首页链接,然后连续下滑,可以看到翻页的地方:点击第二页额、第三页、第四页,然后记录下每一页的链接,可以发现如下规律:第一
1.使用工具
python3
pymongo
mongodb
selenium
chrome浏览器
2.具体分析
2.1 URL分析
打开京东首页,在搜索框任意输入一个商品名称,这里以华为最新发布的手机华为p50为例,点击搜索,页面如下所示:
可能会出现登录界面,可以先登录一下:
进入首页后,先记录首页链接,然后连续下滑,可以看到翻页的地方:
点击第二页额、第三页、第四页,然后记录下每一页的链接,可以发现如下规律:
第一页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=1&s=58&click=0
第二页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=3&s=58&click=0
第三页: https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=5&s=121&click=0
第四页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=7&s=181&click=0
不难发现,每一页的链接几乎一样,区别在于最后的page参数和s参数不一样,还能发现如果把链接最后面的两个参数去掉照样可以访问每一页的内容,去掉s参数的URL更容易构造请求链接:
第一页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=1
第二页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=3
第三页: https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=5
第四页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=7
在翻页过程中还能发现每一页有60条数据,但是分两次加载,每次加载30个商品信息,也即是说如果我们使用requests库,则每页的商品信息只能爬取一半,这显然是不可取的,所以我们选择使用selenium库爬取信息。
3.使用selenium
3.1 使用selenium获取cookie
在使用selenium连续获取每一页的请求时,有时会被服务器重定向到登录界面,所以我们可以先使用selenium获取登录之后的cookie,之后在获取其他页面时,如果没有重定向则直接提取信息,遇到重定向时可以把事先获取的cookie加上,再次发起请求过去网页源码。
获取cookie的代码如下:
from selenium import webdriver
import json, time
def get_cookie(url):
browser = webdriver.Chrome()
browser.get(url)
time.sleep(60)
dictCookies = browser.get_cookies() # 获取list的cookies
jsonCookies = json.dumps(dictCookies) # 转换成字符串保存
with open('cookies.txt', 'w') as f:
f.write(jsonCookies)
print('cookies保存成功!')
browser.close()
if __name__ == '__main__':
get_cookie('https://passport.jd.com/new/login.aspx')
把url设置为登录界面链接https://passport.jd.com/new/login.aspx,先运行这一段代码,会出现登录界面,给你一分钟的时间扫码登录(必须在一分钟内扫码登录成功!),待程序运行结束,会生成一个cookies.txt文件,这个文件中就保存了登录所需的cookie值。
3.2 构造请求链接
由2.1的分析,构造前十页的链接如下所示:
base_url = 'https://search.jd.com/Search?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&suggest=3.def.0.base&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=ddb6b5344d4c452496da22357a030be8&page={}'
url_list = [base_url.format(i) for i in range(1, 20, 2)]
数组url_list中即为前十页的请求链接!
3.3 获取网页源码
def get_html(url):
browser = webdriver.Chrome()
browser.get(url)
# 如果当前链接跟请求链接一致,先执行js代码下滑到底部,获取所有商品信息,再返回源码。
if browser.current_url == url:
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
responses = browser.page_source
browser.close()
return responses
# 如果不一致,则说明发生了重定向,先把本地的cookie加上,再重新发起请求
else:
with open('cookies.txt', 'r', encoding='utf8') as f:
listCookies = json.loads(f.read())
for cookie in listCookies:
cookie_dict = {
'domain': '.jd.com',
'name': cookie.get('name'),
'value': cookie.get('value'),
"expires": '1629446549',
'path': '/',
'httpOnly': False,
'HostOnly': False,
'Secure': False
}
browser.add_cookie(cookie_dict)
time.sleep(1)
browser.get(url)
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
responses = browser.page_source
browser.close()
return responses
3.4 网页源码解析
解析网页源码我喜欢用xpath表达式提取,所以使用lxml模块解析网页,这里提取了商品标题、价格、店铺名称、评论人数以及优惠信息。由于提取的标题信息有点乱,所以加了一个process_list函数清洗一下数据。
def parser(responses):
res = etree.HTML(responses)
li_list = res.xpath('//*[@id="J_goodsList"]/ul/li')
info = []
for li in li_list:
title = li.xpath('./div/div[4]/a/em//font/text()')[0]
all_title = li.xpath('./div/div[4]/a/em/text()')
all_title = title + process_list(all_title)
price = li.xpath('./div/div[3]/strong/i/text()')[0]
shop = li.xpath('./div/div[7]/span/a/text()')
comment_num = li.xpath('./div/div[5]/strong/a/text()')
discount = li.xpath('./div/div[8]/i/text()')
print(all_title, price, shop, comment_num, discount)
a = {'title': all_title, 'price': price, 'shop': shop, 'comment_num': comment_num, 'discount': discount}
info.append(a)
return info
3.5 数据存储
提取的信息保存到mongodb数据库,创建一个数据库Jingdong,集合名称huawei P50,把解析模块返回的信息以此存入数据库,如下所示:
def save_info_to_mongo(info):
client = pymongo.MongoClient('localhost', 27017)
collection = Collection(Database(client, 'Jingdong'), 'huawei P50')
for info in info:
collection.insert_one(info)
client.close()
4.整体代码如下
import json
import time
from pymongo.database import Database
from pymongo.collection import Collection
import pymongo
from lxml import etree
from selenium import webdriver
def get_html(url):
browser = webdriver.Chrome()
browser.get(url)
if browser.current_url == url:
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
responses = browser.page_source
browser.close()
return responses
else:
with open('cookies.txt', 'r', encoding='utf8') as f:
listCookies = json.loads(f.read())
for cookie in listCookies:
cookie_dict = {
'domain': '.jd.com',
'name': cookie.get('name'),
'value': cookie.get('value'),
"expires": '1629446549',
'path': '/',
'httpOnly': False,
'HostOnly': False,
'Secure': False
}
browser.add_cookie(cookie_dict)
time.sleep(1)
browser.get(url)
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
responses = browser.page_source
browser.close()
return responses
def parser(responses):
res = etree.HTML(responses)
li_list = res.xpath('//*[@id="J_goodsList"]/ul/li')
info = []
for li in li_list:
title = li.xpath('./div/div[4]/a/em//font/text()')[0]
all_title = li.xpath('./div/div[4]/a/em/text()')
all_title = title + process_list(all_title)
price = li.xpath('./div/div[3]/strong/i/text()')[0]
shop = li.xpath('./div/div[7]/span/a/text()')
comment_num = li.xpath('./div/div[5]/strong/a/text()')
discount = li.xpath('./div/div[8]/i/text()')
print(all_title, price, shop, comment_num, discount)
a = {'title': all_title, 'price': price, 'shop': shop, 'comment_num': comment_num, 'discount': discount}
info.append(a)
return info
def save_info_to_mongo(info):
client = pymongo.MongoClient('localhost', 27017)
collection = Collection(Database(client, 'Jingdong'), 'huawei P50')
for info in info:
collection.insert_one(info)
client.close()
def process_list(lists):
a = ''
for i in lists:
b = i.replace('\n', '').replace('【', '').replace('】', '').replace('-', '')
a += b
return a
if __name__ == '__main__':
base_url = 'https://search.jd.com/Search?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&suggest=3.def.0.base&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=ddb6b5344d4c452496da22357a030be8&page={}'
url_list = [base_url.format(i) for i in range(1, 20, 2)]
for page_url in url_list:
save_info_to_mongo(parser(get_html(page_url)))
5.结果展示
6.最后
例子仅供参考学习,如有错误,敬请指出!
更多推荐
所有评论(0)