【第1291期】Puppeteer入门简介

前端早读课前端早读课 2019-06-04

前言

现在的节日可真多，也成为了展现每家公司或团队机会。今日早读文章由美团@红烧牛肉面投稿分享。

@红烧牛肉面，服务于美团点评到综前端团队，有5年的前端开发经验，专注于h5页面的开发，正努力学习node相关开发。

Puppeteer是什么

Puppeteer简单来说就是一个Node库，它提供了封装良好的API来控制 headless Chrome或者Chromium。它是由GoogleChrome团队进行维护的，有着良好的兼容性和应用前景，它的出现让PhantomJS、Selenium/WebDriver等其他headless浏览器貌似停止维护了。

写下本文时PhantomJS最新版v2.1，发布时间为2016.01.23

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.

Puppeteer能干什么

你用浏览器能做的事，Puppeteer大部分都能做。

Most things that you can do manually in the browser can be done using Puppeteer!
Here are a few examples to get you started:
Generate screenshots and PDFs of pages.
Crawl a SPA and generate pre-rendered content (i.e. “SSR”).
Automate form submission, UI testing, keyboard input, etc.
Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
Capture a timeline trace of your site to help diagnose performance issues.

应用最多的恐怕是上面提到的爬取网页以及自动化测试。

Puppeteer怎么用

准备工作

npm i puppeteer

puppeteer npm包还是比较大的，有70M左右，并且需要翻墙下载Chromium。可以考虑 -g 全局安装。

Puppeteer需要 Node v6.40及以上版本，同时官网的例子都使用了 async/await 语法，所以最好是 Node v7.60及以上版本。

Caution: Puppeteer requires at least Node v6.4.0, but the examples below use async/await which is only supported in Node v7.6.0 or greater.

个人也是建议尽量用比较新的Node版本，后面的例子也将大量使用 async/await 语法，保证代码易于阅读和理解。

其实Puppeteer的使用就是怎么使用API接口，Puppeteer API文档也是写得非常好，所以下面就直接举例子了。

例子1：多种尺寸截屏

const puppeteer = require('puppeteer');
(async () => {
    const browser = await puppeteer.launch({
        headless:false
    });
    const page = await browser.newPage();

    await example1(page,browser); //后续例子替换这里就好了

    await browser.close();
})();
async function example1(page,browser){
    await page.goto('https://www.jd.com/',{
        waitUntil:'load'
    });

    /**
     * pdf功能需要设置launch的headless为true
     */
    // await page.pdf({
    //     path: 'example1.pdf', 
    //     format: 'A4'
    // });

    //默认截屏可视区域 800*600
    await page.screenshot({
        path: 'example1-1.png'
    });

    /**
     * setViewport重新定义可视区域 1420*1000
     */
    await page.setViewport({
        width: 1420, 
        height: 1000
    });
    await page.screenshot({
        path: 'example1-2.png'
    });

    //全屏截取
    await page.screenshot({
        path: 'example1-3.png',
        fullPage: true
    });
}

puppeteer.launch创建了一个Browser实例，类似于我们打开了浏览器。

browser.newPage创建了一个Page实例，类似于我们在浏览器中新开了一个页面。

browser.close不用浏览器了当然要记得关掉，防止内存泄露。

以上三个方法是最基本的，是程序完整运行必备的。

page.goto就是导航到要访问的页面，注意可以提供 waitUntil 参数来配置什么时候导航结束，毕竟你不能指望网页一打开就好了吧。例子中等待页面 load 事件触发，当然还有其他可选值”domcontentloaded”、”networkidle0”、”networkidle2”，详情请参考文档说明。

page.screenshot截屏的默认尺寸是800*600，可以用page.setViewport重新定义可视区域。如果一心想截全屏，page.screenshot的 fullPage 配置可以满足你。

例子2：懒加载及数据滚动加载

如果成功运行例子1的话，会发现jd的首页fullPage截屏有很多空白的地方，这是因为jd首页采用了懒加载以及数据滚动加载的技术。解决办法当然就是让网页滚动起来。

async function example2(page,browser){
await page.goto('https://www.jd.com/',{
    waitUntil:'networkidle2'
});
await autoScroll(page);
//全屏截取
await page.screenshot({
    path: 'example2.png',
    fullPage: true
});
}
async function autoScroll(page){
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            var totalHeight = 0;
            var distance = 200;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;

                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 200);
        });
    });
}

核心函数autoScroll使用了 page.evaluate，这是一个非常强大的函数，它创建了一个页面上下文，类似于浏览器的开发者工具，可以执行原生js代码，DOM操作自然不在话下。

例子3：模拟移动设备

const devices = require('puppeteer/DeviceDescriptors');
const iPhone6 = devices['iPhone 6'];
async function example3(page,browser){
    await page.goto('https://www.jd.com/',{
        waitUntil:'networkidle2'  //默认值 load
    });
    //await page.emulate(iPhone6);
    await page.emulate({
        viewport: {
            width: 375,
            height: 667,
            isMobile: true
        },
        userAgent: '"Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"'
    });

    await autoScroll(page);
    await page.waitFor(3*1000); 
}

page.emulate本质上是page.setViewport和page.setUserAgent二者的快捷调用方式。

通过参数viewport配置你想要的屏幕大小，要是嫌麻烦，puppeteer还贴心的准备了DeviceDescriptors映射表，通过定义好的手机型号获取相应参数。

page.waitFor传入以毫秒为单位的时间间隔可以让页面暂停一段时间，这点在自动化测试查看页面时很有用，当然waitFor不仅仅就这么简单，文档还介绍了更多的功能，后面的例子也会有所提及。

例子4：页面爬取及自动化页面操作

    async function example4(page,browser){
    await page.goto('https://www.jd.com/',{
        waitUntil:'networkidle2'  //默认值 load
    });
    await page.setViewport({
        width: 1185, 
        height: 800
    });


    /**
     * 获取element内容-方法1
     */
    const result = await page.$eval('#hotwords',el=>{
        return el.innerText
    });
    console.log('page.$eval',result)

    /**
     * 获取element内容-方法2
     */
    // const result = await page.evaluate(()=>{
    //     let element = document.querySelector('#hotwords');
    //     return element.innerText;
    // })
    //console.log('page.evalute',result)

    await page.waitFor(3*1000); 

    /**
     * element键盘-方法1
     */
    // await page.type('#search input', 'macbook pro', {
    //     delay: 300
    // });
    // await page.keyboard.press('Enter');

    /**
     * element键盘-方法2
     */
    const inputEle = await page.$('#search input');
    await inputEle.type('macbook pro',{
        delay:300
    });
    // await inputEle.screenshot({
    //     path: 'example5.png'
    // });
    //await inputEle.press('Enter');

    /**
     * element点击-方法1
     */
    //await page.click('#search button');

    /**
     * element点击-方法2
     */
    const btnEle = await page.$('#search button');
    await btnEle.click();

    await page.waitFor(5*1000); 
}

前面提到page.evaluate是很强大的，但如果只是想拿一个element的文本，未免小题大作了。可以使用page.$eval，它将 document.querySelector 获取到的结果作为参数传给回调函数处理。

在文档中还有类似的 page.$$eval，它的作用是将 document.querySelectorAll 获取到的结果作为参数传给回调函数处理。

page.$调用document.querySelector，但它的区别在于没有回调函数，直接返回一个包装好的 ElementHandle类。

类似的，page.$$调用的是document.querySelectorAll，也是返回一个包装好的ElementHandle类。

ElementHandle就拥有后面所用到的 type、click、press等方法，特别提一下screenshot，它只截取ElemetHandle所在区域的图片。

elementHandle.screenshot([options])
This method scrolls element into view if needed, and then uses page.screenshot to take a screenshot of the element. If the element is detached from DOM, the method throws an error.

会提问题的才是好孩子

组合键要怎么做，比如输入一个大写字母

输入一个大写的M

await page.focus('input');
await page.keyboard.down('Shift');
await page.keyboard.press('KeyM');
await page.keyboard.up('Shift');

像点击a标签一样，跳转到新页面要怎么处理

分两种情况

如果像a标签target属性不为_blank一样，没有新开页面，则可以不用担心，page会一直指向最新页面，如果想返回原来的页面，page.goBack、page.goForward、page.goto可以满足你。
如果a标签target属性为_blank，新开了页面，可以使用let pages = await browser.pages()，它返回当前的页面的page类数组集合，想要哪个页面数组中拿就行，不过原来的page还是一直指向原来的页面。

文章一开始说Puppeterr能做大部分事，意思是还有不能做的

Puppeteer只支持 Chromium or Chrome。
Puppeteer对audio和video支持不是很好，原因点官方回答

结语

Puppeteer使用起来还是挺方便的，API也很强大，文章里只提了很小的一部分功能，更多功能还是建议阅读官方文档。

参考

Puppeteer：模拟浏览器操作行为的利器：

https://github.com/chenxiaochun/blog/issues/38

Puppeteer: 更友好的 Headless Chrome Node API：

https://www.cnblogs.com/dolphinX/p/7715268.html

爬虫利器 Puppeteer 实战：

https://cnodejs.org/topic/5a4d8d2299d207fa49f5cbbc

关于本文
作者：@红烧牛肉面
原文：https://github.com/masterkong/blog/issues/6

宾曰语云被法学教授投诉：严重侵权，“违法犯罪”！

当“上帝”变为“老天爷”

京东Plus的隐藏特权，很多会员都没领取，白交了会员费...

呼吁四川大学澄清：1998年1月，川大有多少个“姜涛与爱人程月玲”？

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻