R 爬虫之爬取 NCBI 文献

Original JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集

#R语言爬虫虫 7 个

#R编程笔记 20 个

点击上方关注我们

故事起源

前几天有粉丝私信问我能不能爬取 NCBI 的文献，其实觉得没必要，批量爬取下来你不一定会看，最多爬取一下每篇文章的摘要，如果感兴趣再去下载对应的文献进行精读，既然说了自己也趁着在学习爬虫，试着爬取一下吧，画了一晚上的功夫也就下面这样了。

先看看

今天尝试一下爬取 NCBI 的文献并下载，NCBI 上的 pubmed 文献下载有好几个来源：

1、pubmed 可免费下的会有 PMC full text 标志

2、直接到 期刊网址 里下载免费文献

3、还有一种在搜索文献主界面没有 Free （PMC） article 标注，里面只有该期刊的链接，具体不知道是不是可以免费下载，只有点进去看了才知道

第二篇点进去之后：

尝试下载

我的思路是对于PMC full text可以免费下载的文献我们下载，对应链接到期刊网址的只输出文献下载地址，没有下载到的自行点击链接查看是否可以下载相应的文献。

1、获取文章标题，网址

首先加载 R 包，构造网页链接，然后获取每篇文章的标题和链接，我们还是搜索 m6A 相关的文献：

library(rvest)
library(tidyverse)

# 构造网页
for (i in 1:1) {
  u <- paste('https://pubmed.ncbi.nlm.nih.gov/?term=m6a&page=',i,sep = '')
}
u
[1] "https://pubmed.ncbi.nlm.nih.gov/?term=m6a&page=1"

这里我们以第一页 10 篇文章做测试。

获取文章所有标题：

# 获取文章标题
title <- read_html(u) %>%
  html_nodes('.docsum-title') %>%
  html_text(trim = T)
# 查看内容
head(title,3)
[1] "Link Between m6A Modification and Cancers."
[2] "The role of m6A RNA methylation in cancer."
[3] "m6A Modification and Implications for microRNAs."

获取每篇文章对应的进入链接地址：

# 获取每篇文章链接
artc_link <- read_html(u) %>%
  html_nodes('.docsum-title') %>%
  html_attr('href') %>%
  paste('https://pubmed.ncbi.nlm.nih.gov',.,sep = '')
# 查看内容
head(artc_link,3)
[1] "https://pubmed.ncbi.nlm.nih.gov/30062093/" "https://pubmed.ncbi.nlm.nih.gov/30784918/"
[3] "https://pubmed.ncbi.nlm.nih.gov/28494721/"

然后我们需要判断每篇文章里面的右侧下载链接是由含有Free PMC article的字符串，我们点进某篇文章 右键检查 ，查看 相关节点信息 ：

看到两个链接都含有 link-item 属性，我们提取该属性的文本内容试试：

artc_link[1] %>%
  read_html() %>%
  html_nodes('.link-item') %>%
  html_text(trim = T)
[1] "Frontiers Media SA" "Free PMC article"   "Frontiers Media SA" "Free PMC article"
[5] "Clipboard"          "Email"              "Save"               "My Bibliography"
[9] "Collections"        "Citation Manager"

可以看到含有Free PMC article的字符串，我们拿第二篇文章没有PMC full text链接的试一试：

artc_link[2] %>%
  read_html() %>%
  html_nodes('.link-item') %>%
  html_text(trim = T)
[1] "Elsevier Science" "Elsevier Science" "Clipboard"        "Email"            "Save"
[6] "My Bibliography"  "Collections"      "Citation Manager"

发现里面并没有Free PMC article的字符串，因此我们应该可以利用这个来判断是不是Free PMC article的文章。

我们点击PMC full text进去会到新的页面，里面含有 pdf 下载的链接：

可以看到这个网页的地址可以在上一页的 href 中获得，我们使用 xpath 相对路径提取，虽然网址后面和上面的网址栏里有一些不一样，但是进去都会到 同一个页面 ：

提取链接地址：

artc_link[1] %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="full-text-links-dialog"]/div/a[2]') %>%
  html_attr('href')
[1] "https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/30062093/"

2、获取文章pdf下载地址

然后还需要获得 pdf 的下载地址，右边 PDF 链接点击后会直接进入文章 pdf 的浏览界面：

同样这个网页的地址可以在上一页的 href 中获得，我们使用 xpath 相对路径提取。

提取地址：

artc_link[1] %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="full-text-links-dialog"]/div/a[2]') %>%
  html_attr('href') %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="rightcolumn"]/div[2]/div/ul/li[4]/a') %>%
  html_attr('href')
[1] "/pmc/articles/PMC6055048/pdf/fbioe-06-00089.pdf"

只提取到后面部分的地址，我们还需要加上前缀：

artc_link[1] %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="full-text-links-dialog"]/div/a[2]') %>%
  html_attr('href') %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="rightcolumn"]/div[2]/div/ul/li[4]/a') %>%
  html_attr('href') %>%
  paste('https://www.ncbi.nlm.nih.gov/',.,sep = '')
[1] "https://www.ncbi.nlm.nih.gov//pmc/articles/PMC6055048/pdf/fbioe-06-00089.pdf"

接下来加个条件判断批量获取 pdf 下载链接，按照上面说的思路：

# 批量获取下载链接
down_link <- list()
for (i in 1:length(artc_link)) {
  origin <- artc_link[i] %>% read_html() %>%
    html_nodes('.link-item') %>%
    html_text(trim = T)
  # 判断是否PMC free
  if("Free PMC article" %in% origin){
    download_link <- artc_link[i] %>% read_html() %>%
      html_nodes(xpath = '//*[@id="full-text-links-dialog"]/div/a[2]') %>%
      html_attr('href') %>%
      read_html() %>%
      html_nodes(xpath = '//*[@id="rightcolumn"]/div[2]/div/ul/li[4]/a') %>%
      html_attr('href') %>%
      paste('https://www.ncbi.nlm.nih.gov/',.,sep = '')
    # 储存link信息
    down_link[[artc_link[i]]] <- download_link
  }else{
    # 获取文章来源
    web_link <- artc_link[i] %>%
      read_html() %>%
      html_nodes('.link-item') %>%
      html_attr('href') %>%
      head(2) %>%
      unique()
    # 保存
    down_link[[artc_link[i]]] <- web_link[1]
  }
}

# 查看长度
length(down_link)
[1] 10
# 查看内容
head(down_link,3)
$`https://pubmed.ncbi.nlm.nih.gov/30062093/`
[1] "https://www.ncbi.nlm.nih.gov//pmc/articles/PMC6055048/pdf/fbioe-06-00089.pdf"

$`https://pubmed.ncbi.nlm.nih.gov/30784918/`
[1] "https://linkinghub.elsevier.com/retrieve/pii/S0753-3322(18)38294-5"

$`https://pubmed.ncbi.nlm.nih.gov/28494721/`
[1] "http://www.eurekaselect.com/152355/article"

然后把我们的文章 title、文章链接、pdf 下载链接转为数据框储存起来：

# 转为数据框
info <- data.frame(title_name = title,
                   arcticle_link = artc_link,
                   download_link = down_link %>% Reduce(rbind,.))

我们看看：

3、下载文章 pdf 文件

既然我们获得了文章的下载链接，那就直接批量下载 pdf 文献到本地了！当然是下载 pmc free 的。开干：

# 创建文件夹
dir.create('C:/Users/admin/Desktop/pdf_download/')

# 选中可下载文献
get_file <- info[c(grep('^https://www.ncbi.nlm.nih.gov/',info$download_link)),]

# 文件名
pdf_name <- get_file$title_name

# 文件下载路径
pdf_link <- get_file$download_link

# 批量下载
for (i in 1:length(pdf_link)) {
  # 文件名称
  p_name <- paste('C:/Users/admin/Desktop/pdf_download/',pdf_name[i],'pdf',sep = '')
  # 下载
  download.file(url = pdf_link[i],
                destfile = p_name,
                mode = 'wb')
}
试开URL’https://www.ncbi.nlm.nih.gov//pmc/articles/PMC6055048/pdf/fbioe-06-00089.pdf'
Content type 'application/pdf' length 1918139 bytes (1.8 MB)
downloaded 1.8 MB

试开URL’https://www.ncbi.nlm.nih.gov//pmc/articles/PMC7047367/pdf/12943_2020_Article_1172.pdf'
Content type 'application/pdf' length 776866 bytes (758 KB)
downloaded 758 KB
...
Error in download.file(url = pdf_link[i], destfile = p_name, mode = "wb") :
  无法打开目的文件'C:/Users/admin/Desktop/pdf_download/Reduced m6A modification predicts malignant phenotypes and augmented Wnt/PI3K-Akt signaling in gastric cancer.pdf'，原因是'No such file or directory'

可以看到下载到最后一篇发生了报错，是因为这篇 文章标题的字符有问题 ，含有斜杠/，我们就不能把文章名称作为保存文件的名字，我这边改为 4 为文件名。

文章名称：

重新下载这篇文章：

for (i in 4) {
  # 文件名称
  p_name <- paste('C:/Users/admin/Desktop/pdf_download/',4,'.pdf',sep = '')
  # 下载
  download.file(url = pdf_link[i],
                destfile = p_name,
                mode = 'wb')
}
试开URL’https://www.ncbi.nlm.nih.gov//pmc/articles/PMC7293016/pdf/gkaa347.pdf'
Content type 'application/pdf' length 838520 bytes (818 KB)
downloaded 818 KB

最后到这里就大功告成了，然后我们打开下载的 pdf 文件检查一下：

文章内容：

最后把数据框数据输出保存一下：

# 输出保存
write.csv(info,file = 'C:/Users/admin/Desktop/pdf_download/meta.csv',row.names = F)

所以今天你学习了吗？

发现更多精彩

关注公众号

欢迎小伙伴留言评论！

今天的分享就到这里了，敬请期待下一篇！

最后欢迎大家分享转发，您的点赞是对我的鼓励和肯定！

如果觉得对您帮助很大，赏杯快乐水喝喝吧！

推荐阅读

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

R 爬虫之爬取 NCBI 文献

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

生成图片，分享到微信朋友圈

R 爬虫之爬取 NCBI 文献

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡