其他
【他山之石】python从零开始构建知识图谱
“他山之石,可以攻玉”,站在巨人的肩膀才能看得更高,走得更远。在科研的道路上,更需借助东风才能更快前行。为此,我们特别搜集整理了一些实用的代码链接,数据集,软件,编程技巧等,开辟“他山之石”专栏,助你乘风破浪,一路奋勇向前,敬请关注。
地址:https://www.zhihu.com/people/wxj630
知识图谱是数据科学中最迷人的概念之一 学习如何构建知识图谱来从维基百科页面挖掘信息 您将在Python中动手使用流行的spaCy库构建知识图谱
01
1、什么是知识图谱
2、句子分割Sentence Segmentation
在最新的男子单打排名中,印度网球选手纳加尔(Sumit Nagal)上升了6位,从135名上升到职业生涯最好的129名。这位22岁的选手最近赢得了ATP挑战赛的冠军。在2019年的美国网球公开赛上,他迎来了自己的大满贯处子秀,对手是费德勒。纳加尔赢了第一盘。
Indian tennis player Sumit Nagal moved up six places from 135 to a career-best 129 in the latest men’s singles ranking The 22-year-old recently won the ATP Challenger tournament He made his Grand Slam debut against Federer in the 2019 US Open Nagal won the first set
3、实体识别Entities Recognition
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("The 22-year-old recently won ATP Challenger tournament.")
for tok in doc:
print(tok.text, "...", tok.dep_)
'''
输出:
The … det
22-year … amod
– … punct
old … nsubj
recently … advmod
won … ROOT
ATP … compound
Challenger … compound
tournament … dobj
. … punct
'''
4、关系抽取Extract Relations
doc = nlp("Nagal won the first set.")
for tok in doc:
print(tok.text, "...", tok.dep_)
'''
输出:
Nagal … nsubj
won … ROOT
the … det
first … amod
set … dobj
. … punct
'''
02
1、导入相关库Import Libraries
import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
from spacy.tokens import Span
import networkx as nx
import matplotlib.pyplot as plt
from tqdm import tqdm
pd.set_option('display.max_colwidth', 200)
%matplotlib inline
2、读取文本数据Read Data
# import wikipedia sentences
candidate_sentences = pd.read_csv("wiki_sentences_v2.csv")
candidate_sentences.shape
'''
(4318, 1)
'''
candidate_sentences['sentence'].sample(5)
doc = nlp("the drawdown process is governed by astm standard d823")
for tok in doc:
print(tok.text, "...", tok.dep_)
3、抽取"主语-宾语"对Entity Pairs Extraction
def get_entities(sent):
## chunk 1
# 我在这个块中定义了一些空变量。prv tok dep和prv tok text将分别保留句子中前一个单词和前一个单词本身的依赖标签。前缀和修饰符将保存与主题或对象相关的文本。
ent1 = ""
ent2 = ""
prv_tok_dep = "" # dependency tag of previous token in the sentence
prv_tok_text = "" # previous token in the sentence
prefix = ""
modifier = ""
#############################################################
for tok in nlp(sent):
## chunk 2
# 接下来,我们将遍历句子中的记号。我们将首先检查标记是否为标点符号。如果是,那么我们将忽略它并转移到下一个令牌。如果标记是复合单词的一部分(dependency tag = compound),我们将把它保存在prefix变量中。复合词是由多个单词组成一个具有新含义的单词(例如“Football Stadium”, “animal lover”)。
# 当我们在句子中遇到主语或宾语时,我们会加上这个前缀。我们将对修饰语做同样的事情,例如“nice shirt”, “big house”
# if token is a punctuation mark then move on to the next token
if tok.dep_ != "punct":
# check: token is a compound word or not
if tok.dep_ == "compound":
prefix = tok.text
# if the previous word was also a 'compound' then add the current word to it
if prv_tok_dep == "compound":
prefix = prv_tok_text + " "+ tok.text
# check: token is a modifier or not
if tok.dep_.endswith("mod") == True:
modifier = tok.text
# if the previous word was also a 'compound' then add the current word to it
if prv_tok_dep == "compound":
modifier = prv_tok_text + " "+ tok.text
## chunk 3
# 在这里,如果令牌是主语,那么它将作为ent1变量中的第一个实体被捕获。变量如前缀,修饰符,prv tok dep,和prv tok文本将被重置。
if tok.dep_.find("subj") == True:
ent1 = modifier +" "+ prefix + " "+ tok.text
prefix = ""
modifier = ""
prv_tok_dep = ""
prv_tok_text = ""
## chunk 4
# 在这里,如果令牌是宾语,那么它将被捕获为ent2变量中的第二个实体。变量,如前缀,修饰符,prv tok dep,和prv tok文本将再次被重置。
if tok.dep_.find("obj") == True:
ent2 = modifier +" "+ prefix +" "+ tok.text
## chunk 5
# 一旦我们捕获了句子中的主语和宾语,我们将更新前面的标记和它的依赖标记。
# update variables
prv_tok_dep = tok.dep_
prv_tok_text = tok.text
#############################################################
return [ent1.strip(), ent2.strip()]
get_entities("the film had 200 patents")
'''
Output: [‘film’, ‘200 patents’]
'''
entity_pairs = []
for i in tqdm(candidate_sentences["sentence"]):
entity_pairs.append(get_entities(i))
entity_pairs[10:20]
4、关系抽取Relation / Predicate Extraction
def get_relation(sent):
doc = nlp(sent)
# Matcher class object
matcher = Matcher(nlp.vocab)
#define the pattern
pattern = [{'DEP':'ROOT'},
{'DEP':'prep','OP':"?"},
{'DEP':'agent','OP':"?"},
{'POS':'ADJ','OP':"?"}]
matcher.add("matching_1", None, pattern)
matches = matcher(doc)
k = len(matches) - 1
span = doc[matches[k][1]:matches[k][2]]
return(span.text)
get_relation("John completed the task")
'''
Output: completed
'''
relations = [get_relation(i) for i in tqdm(candidate_sentences['sentence'])]
pd.Series(relations).value_counts()[:50]
5、构建知识图谱Build a Knowledge Graph
# extract subject
source = [i[0] for i in entity_pairs]
# extract object
target = [i[1] for i in entity_pairs]
kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})
# create a directed-graph from a dataframe
G=nx.from_pandas_edgelist(kg_df, "source", "target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="composed by"], "source", "target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5) # k regulates the distance between nodes
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="written by"], "source", "target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5)
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
03
[1] Knowledge Graph – A Powerful Data Science Technique to Mine Information from Text (with Python code):
https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/
[2] spacy文档:
https://github.com/explosion/spaCy
[3] spacy中文教程:
https://www.jianshu.com/p/e6b3565e159d
本文目的在于学术交流,并不代表本公众号赞同其观点或对其内容真实性负责,版权归原作者所有,如有侵权请告知删除。
直播预告
“他山之石”历史文章
一文读懂 PyTorch 模型保存与载入
适合PyTorch小白的官网教程:Learning PyTorch With Examples
pytorch量化备忘录
LSTM模型结构的可视化
PointNet论文复现及代码详解
SCI写作常用句型之研究结果&发现
白话生成对抗网络GAN及代码实现
pytorch的余弦退火学习率
Pytorch转ONNX-实战篇(tracing机制)
联邦学习:FedAvg 的 Pytorch 实现
PyTorch实现ShuffleNet-v2亲身实践
训练时显存优化技术——OP合并与gradient checkpoint
浅谈数据标准化与Pytorch中NLLLoss和CrossEntropyLoss损失函数的区别
在C++平台上部署PyTorch模型流程+踩坑实录
libtorch使用经验
更多他山之石专栏文章,
请点击文章底部“阅读原文”查看
分享、点赞、在看,给个三连击呗!