python爬虫爬取淘宝商品并保存至mongodb数据库

使用工具介绍python3.8selenium（请确保你已经成功安装了谷歌驱动chromedriver）mongodb数据库mongo-compass谷歌浏览器分析请求链接打开淘宝首页的链接，https://www.taobao.com/如下：这里以商品ipad为例，在搜索框输入ipad，点击搜索，如下所示：复制前四页的链接：找一下规律第一页：https://uland.taobao.com/se

阿尔的阳光y

5062人浏览 · 2021-08-09 21:39:01

阿尔的阳光y · 2021-08-09 21:39:01 发布

使用工具介绍

python3.8
selenium（请确保你已经成功安装了谷歌驱动chromedriver）
mongodb数据库
mongo-compass
谷歌浏览器

分析请求链接

打开淘宝首页的链接，https://www.taobao.com/如下：在这里插入图片描述
这里以商品ipad为例，在搜索框输入ipad，点击搜索，如下所示：

复制前四页的链接：找一下规律

第一页：https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.search.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979
第二页：https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum=1
第三页：https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum=2
第四页：https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum=3
…

很明显的可以看到除了第一页链接有些特殊，其他几页的链接基本一样，唯一的区别是最后的pnum的值不一样！分别为1，2，3。还可以发现当我们把值设置为0时，打开链接就是第一页的数据！可以根据这个规律构造前十页的链接：

base_url = 'https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum={}'
url_list = [base_url.format(i) for i in range(10)]

获取网页源码

直接使用requests获取网页源码是获取不到全部数据的，因此这里选择使用selenium获取网页源码数据，代码如下：

def get_html(url):
    browser = webdriver.Chrome()
    browser.get(url)
    response = browser.page_source
    browser.close()
    return response

解析数据

解析数据可以使用lxml模块。因为xpath表达式可以直接从谷歌浏览器复制，所以用这个比较简单！我这里简单了商品标签、店铺名称、商品价格、以及销量四个数据。
代码如下：

def parser(response):
    html = etree.HTML(response)
    li_list = html.xpath('//*[@id="mx_5"]/ul/li')
    ipad_info = []
    for li in li_list:
        title = li.xpath('./a/div[1]/span/text()')[0]
        price = li.xpath('./a/div[2]/span[2]/text()')[0]
        shop = li.xpath('./a/div[3]/div/text()')[0]
        sales = li.xpath('./a/div[4]/div[2]/text()')[0]
        ipad_info.append({'title': title, 'price': price, 'shop': shop, 'sales': sales})
    return ipad_info

存储数据

把解析的数据保存到mongodb数据库。这里新建一个名为Taobao的数据库，集合名称为ipad_info

def save_info_to_mongo(ipad_info):
    client = pymongo.MongoClient('localhost', 27017)
    collection = Collection(Database(client, 'Taobao'), 'ipad_info')
    for info in ipad_info:
        collection.insert_one(info)
    client.close()

全部代码

import pymongo
from lxml import etree
from selenium import webdriver
from pymongo.collection import Collection
from pymongo.database import Database


def get_html(url):
    browser = webdriver.Chrome()
    browser.get(url)
    response = browser.page_source
    browser.close()
    return response


def parser(response):
    html = etree.HTML(response)
    li_list = html.xpath('//*[@id="mx_5"]/ul/li')
    ipad_info = []
    for li in li_list:
        title = li.xpath('./a/div[1]/span/text()')[0]
        price = li.xpath('./a/div[2]/span[2]/text()')[0]
        shop = li.xpath('./a/div[3]/div/text()')[0]
        sales = li.xpath('./a/div[4]/div[2]/text()')[0]
        ipad_info.append({'title': title, 'price': price, 'shop': shop, 'sales': sales})
    return ipad_info


def save_info_to_mongo(ipad_info):
    client = pymongo.MongoClient('localhost', 27017)
    collection = Collection(Database(client, 'Taobao'), 'ipad_info')
    for info in ipad_info:
        collection.insert_one(info)
    client.close()


if __name__ == '__main__':
    base_url = 'https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum={}'
    url_list = [base_url.format(i) for i in range(10)]
    print(url_list)
    for url in url_list:
        ipad_info = parser(get_html(url))
        save_info_to_mongo(ipad_info)