如何用Neo4j和Scikit-Learn做机器学习任务?| 附超详细分步教程
数据库 Neo4J
Neo4j Desktop 地址:
https://neo4j.com/download/
在线阅读地址:
https://neo4j.com/docs/graph-algorithms/current/
链路预测算法
(一)什么是链路预测?
论文地址:
https://www.cs.cornell.edu/home/kleinber/link-pred.pdf
若给定一个社交网络的快照,我们能预测出该网络中的成员在未来可能出现哪些新的关系吗?我们可以把这个问题看作链路预测问题,然后对网络中各节点的相似度进行分析,从而得出预测链路的方法。
演讲视频:
https://youtu.be/kVHdMD-XT9s
(二)链路预测算法
(三)算法评估标准
1、共同邻居数
作为预测因子,共同邻居数可以捕捉到拥有同一个朋友的两个陌生人,而这两个人可能会被这个朋友介绍认识(图中出现一个闭合的三角形)。
2、Adamic Adar(AA 指标)
3、优先连接
(四)链路预测 - Neo4j 图算法库
UNWIND [["A", "C"], ["A", "B"], ["B", "D"],
["B", "C"], ["B", "E"], ["C", "D"]] AS pair
MERGE (n1:Node {name: pair[0]})
MERGE (n2:Node {name: pair[1]})
MERGE (n1)-[:FRIENDS]-(n2)
neo4j> MATCH (a:Node {name: 'A'})
MATCH (d:Node {name: 'D'})
RETURN algo.linkprediction.commonNeighbors(a, d);
+-------------------------------------------+
| algo.linkprediction.commonNeighbors(a, d) |
+-------------------------------------------+
| 2.0 |
+-------------------------------------------+
1 row available after 97 ms, consumed after another 15 ms
neo4j> MATCH (a:Node {name: 'A'})
MATCH (e:Node {name: 'E'})
RETURN algo.linkprediction.commonNeighbors(a, e);
+-------------------------------------------+
| algo.linkprediction.commonNeighbors(a, e) |
+-------------------------------------------+
| 1.0 |
+-------------------------------------------+
neo4j> WITH {direction: "BOTH", relationshipQuery: "FRIENDS"}
AS config
MATCH (a:Node {name: 'A'})
MATCH (e:Node {name: 'E'})
RETURN algo.linkprediction.commonNeighbors(a, e, config)
AS score;
+-------+
| score |
+-------+
| 1.0 |
+-------+
neo4j> MATCH (a:Node {name: 'A'})
MATCH (d:Node {name: 'D'})
RETURN algo.linkprediction.preferentialAttachment(a, d)
AS score;
+-------+
| score |
+-------+
| 4.0 |
+-------+
(五)链路预测所得的分数有何用?
1、直接使用指标
2、有监督学习
参考阅读文章《Link Prediction In Large-Scale Networks》中有对这两种方法的详细介绍:
https://hackernoon.com/link-prediction-in-large-scale-networks-f836fcb05c88?gi=b86a42e1c8d4
构建机器学习分类器
(一)机器学习模型
(二)训练集和测试集
# negative examples = (# nodes)² - (# relationships) - (# nodes)
(三)代码教程:链路预测实战
1、录入引用数据库
// Create constraints
CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE;
CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE;
CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE;
// Import data from JSON files using the APOC library
CALL apoc.periodic.iterate(
'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file
CALL apoc.load.json("https://github.com/mneedham/link-prediction/raw/master/data/" + file)
YIELD value WITH value
RETURN value',
'MERGE (a:Article {index:value.id})
SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
MERGE (v:Venue {name: venue})
MERGE (a)-[:VENUE]->(v)
FOREACH(author in authors |
MERGE (b:Author{name:author})
MERGE (a)-[:AUTHOR]->(b))
FOREACH(citation in citations |
MERGE (cited:Article {index:citation})
MERGE (a)-[:CITED]->(cited))',
{batchSize: 1000, iterateList: true});
2、搭建共同作者图
MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
WITH a1, a2, collect(paper)[0].year AS year,
count(*) AS collaborations
MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
SET coauthor.collaborations = collaborations;
3、训练和测试数据集
训练子图
MATCH (a)-[r:CO_AUTHOR]->(b)
WHERE r.year < 2006
MERGE (a)-[:CO_AUTHOR_EARLY {year: r.year}]-(b);
测试子图
MATCH (a)-[r:CO_AUTHOR]->(b)
WHERE r.year >= 2006
MERGE (a)-[:CO_AUTHOR_LATE {year: r.year}]-(b);
# negative examples = (# nodes)² - (# relationships) - (# nodes)
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_EARLY]-()
MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_EARLY]-(other))
RETURN id(author) AS node1, id(other) AS node2
4、Py2neo, pandas, scikit-learn
pip install py2neo==4.1.3 pandas sklearn
from py2neo import Graph
import pandas as pd
graph = Graph("bolt://localhost", auth=("neo4j", "neo4jPassword"))
5、搭建我们的训练和测试集
# Find positive examples
train_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_EARLY]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()
# Find negative examples
train_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_EARLY]-()
MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_EARLY]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()
# Remove duplicates
train_missing_links = train_missing_links.drop_duplicates()
# Down sample negative examples
train_missing_links = train_missing_links.sample(
n=len(train_existing_links))
# Create DataFrame from positive and negative examples
training_df = train_missing_links.append(
train_existing_links, ignore_index=True)
training_df['label'] = training_df['label'].astype('category')
# Find positive examples
test_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()
# Find negative examples
test_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_LATE]-()
MATCH (author)-[:CO_AUTHOR_LATE*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_LATE]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()
# Remove duplicates
test_missing_links = test_missing_links.drop_duplicates()
# Down sample negative examples
test_missing_links = test_missing_links.sample(n=len(test_existing_links))
# Create DataFrame from positive and negative examples
test_df = test_missing_links.append(
test_existing_links, ignore_index=True)
test_df['label'] = test_df['label'].astype('category')
6、选择机器学习算法
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=30, max_depth=10,
random_state=0)
7、生成链接预测特征
def apply_graphy_features(data, rel_type):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
algo.linkprediction.commonNeighbors(
p1, p2, {relationshipQuery: $relType}) AS cn,
algo.linkprediction.preferentialAttachment(
p1, p2, {relationshipQuery: $relType}) AS pa,
algo.linkprediction.totalNeighbors(
p1, p2, {relationshipQuery: $relType}) AS tn
"""
pairs = [{"node1": pair[0], "node2": pair[1]}
for pair in data[["node1", "node2"]].values.tolist()]
params = {"pairs": pairs, "relType": rel_type}
features = graph.run(query, params).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])
training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
test_df = apply_graphy_features(test_df, "CO_AUTHOR")
columns = ["cn", "pa", "tn"]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
8、评估模型
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
def evaluate_model(predictions, actual):
accuracy = accuracy_score(actual, predictions)
precision = precision_score(actual, predictions)
recall = recall_score(actual, predictions)
metrics = ["accuracy", "precision", "recall"]
values = [accuracy, precision, recall]
return pd.DataFrame(data={'metric': metrics, 'value': values})
def feature_importance(columns, classifier):
features = list(zip(columns, classifier.feature_importances_))
sorted_features = sorted(features, key = lambda x: x[1]*-1)
keys = [value[0] for value in sorted_features]
values = [value[1] for value in sorted_features]
return pd.DataFrame(data={'feature': keys, 'value': values})
predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]
evaluate_model(predictions, y_test)
feature_importance(columns, classifier)
9、三角形与聚类系数
CALL algo.triangleCount('Author', 'CO_AUTHOR_EARLY', {
write:true,
writeProperty:'trianglesTrain',
clusteringCoefficientProperty:'coefficientTrain'});
CALL algo.triangleCount('Author', 'CO_AUTHOR', {
write:true,
writeProperty:'trianglesTest',
clusteringCoefficientProperty:'coefficientTest'});
def apply_triangles_features(data,triangles_prop,coefficient_prop):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
apoc.coll.min([p1[$triangles], p2[$triangles]]) AS minTriangles,
apoc.coll.max([p1[$triangles], p2[$triangles]]) AS maxTriangles,
apoc.coll.min([p1[$coefficient], p2[$coefficient]]) AS minCoeff,
apoc.coll.max([p1[$coefficient], p2[$coefficient]]) AS maxCoeff
"""
pairs = [{"node1": pair[0], "node2": pair[1]}
for pair in data[["node1", "node2"]].values.tolist()]
params = {"pairs": pairs,
"triangles": triangles_prop,
"coefficient": coefficient_prop}
features = graph.run(query, params).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])
training_df = apply_triangles_features(training_df,
"trianglesTrain", "coefficientTrain")
test_df = apply_triangles_features(test_df,
"trianglesTest", "coefficientTest")
columns = [
"cn", "pa", "tn",
"minTriangles", "maxTriangles", "minCoeff", "maxCoeff"
]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]
display(evaluate_model(predictions, y_test))
display(feature_importance(columns, classifier))
Github地址:
https://github.com/neo4j-contrib
原文链接:
https://medium.com/neo4j/link-prediction-with-neo4j-part-1-an-introduction-713aa779fd9
https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42w356b44c
(*本文为AI科技大本营编译文章,转载请微信联系 1092722531)
◆
精彩推荐
◆
开幕倒计时 2 天!2019 中国大数据技术大会(BDTC)即将震撼来袭!豪华主席阵容及百位技术专家齐聚,十余场精选专题技术和行业论坛,超强干货+技术剖析+行业实践立体解读。
陆首群:评人工智能如何走向新阶段?
准备面试题就够了吗?这些内容对考核更重要
一张图生成定制版二次元人脸头像,还能“模仿”你的表情
无需标注数据,利用辅助性旋转损失的自监督GANs,效果堪比现有最好方法
激辩:机器究竟能否理解常识?
Instagram个性化推荐工程中三个关键技术是什么?
从YARN迁移到k8s,滴滴机器学习平台二次开发是这样做的
【建议珍藏系列】如果你这样回答「什么是线程安全」,面试官都会对你刮目相看!
985 高校计算机系学生都在用的笔记本,我被深深地种草了!
从拨号到 5G :互联网登录完全指南
你点的每个“在看”,我都认真当成了AI