[AI安全论文] 05.RAID-Cyber Threat Intelligence Modeling Based on GCN
原文标题:Cyber Threat Intelligence Modeling Based on Heterogeneous
Graph Convolutional Network
原文作者:Jun Zhao, Qiben Yan, Xudong Liu, Bo Li, Guangsheng Zuo
原文链接:https://www.usenix.org/system/files/raid20-zhao.pdf
论文来源:RAID 2020/CCF B
同时,本文参考了“安全学术圈”公众号文章,推荐大家关注该公众号,非常棒。此外,微信标题限制64个字符,标题省略简写。
《娜璋带你读论文》系列主要是督促自己阅读优秀论文及听取学术讲座,并分享给大家,希望您喜欢。由于作者的英文水平和学术能力不高,需要不断提升,所以还请大家批评指正,非常欢迎大家给我留言评论,学术路上期待与您前行,加油~
文章目录:
摘要
Ⅰ.前言
Ⅱ.背景
1.动机
2.前期工作
Ⅲ.HINTI总体架构
Ⅳ.方法论
1.基于多粒度注意力的IOC提取
2.网络威胁情报建模
3.威胁情报计算
Ⅴ.数据集及实验结果
Ⅵ.威胁智能计算技术的应用
Ⅶ.结论和个人感受
1.结论
2.个人感受
Ⅷ.英文优美十句
前文推荐:
[AI安全论文] 05.RAID-Cyber Threat Intelligence Modeling Based on GCN
摘 要
IOC提取的准确性低
孤立的IOC几乎无法描述威胁事件的全面情况
异构IOC之间的相互依存关系尚未得到开发,无法利用它们来挖掘深层次安全知识
提出了基于多粒度注意力机制( multi-granular attention)的IOC识别方法,可以从非结构化威胁描述中自动提取网络威胁对象,并提高准确性
构建一个异构信息网络(HIN)来建模IOCs之间的依赖关系
提出一个基于图卷积网络(Graph Convolutional Networks)的威胁情报计算框架来发现知识
实现了网络威胁情报(CTI)原型系统
Ⅰ.前言
http://cve.mitre.org/
https://www.exploit-db.com/
首先,IOC提取的精度低,不可避免地导致关键威胁对象遗漏。
其次,孤立的IOC没有全面描述威胁事件的概况,这使得CTI用户无法对即将到来的威胁获得完整的了解。
最后,缺乏一个有效的计算框架来有效地衡量异构IOCs之间的交互关系。
Ⅱ.背景
1.动机
(i) 首先,通过B-I-O序列标注方法对安全相关帖子进行标注,用于构建IOC提取模型。
其中,B-X表示X类型的元素位于片段的开头,I-X表示X类型的元素位于中间片段,O表示其他类型的非基本元素。在研究中,我们从5000个威胁描述文本中标注了3万个这样的训练样本,这些文本是用来构建我们IOC提取模型的原始材料。
(iv) 最后,HINTI集成了基于异构图卷积网络的CTI计算框架(见第4.3节),以有效量化IOC之间的关系并进行知识发现。
特别是,本文提出的CTI计算框架描述了IOC及其在低维嵌入空间(low-dimensional embedding space)中的关系,在此基础上,CTI用户可以使用任何分类(如SVM、朴素贝叶斯)或聚类算法(K-Means、DBSCAN)来获得新的威胁见解,例如预测哪些攻击者可能入侵其系统,以及在没有专家知识的情况下识别哪些漏洞属于同一类别。
2.前期工作
Definition 1 Heterogeneous Information Network of Threat Intelligence (HINTI)
Definition 2 Network Schema
Definition 3 Meta-path
attacker (A)
vulnerability (V)
device (D)
platform (P)
malicious file (F)
attack type (T)
Ⅲ.HINTI总体架构
Ⅳ.方法论
1.基于多粒度注意力的IOC提取
2.网络威胁情报建模
R1:attacker-exploit-vulnerability
攻击者利用漏洞
R2:attacker-invade-device
攻击者入侵设备
R3:attacker-cooperate-attacker
攻击者之间合作
R4:vulnerability-affect-device
漏洞影响设备
R5:vulnerability-belong-attack type
脆弱性属于攻击类型
R6:vulnerability-include-file
漏洞包括恶意文件
R7:file-target-device
恶意文件针对设备
R8:vulnerability-evolve-vulnerability
脆弱性演化脆弱性
R9:device-belong-platform
设备属于平台
3.威胁情报计算
给定威胁情报图G =(V,E)和元路径集M = {P1,P2,…,Pi}
i)基于元路径Pi计算IOC之间的相似度,以生成相应的邻接矩阵Ai
ii)通过将IOC的属性信息嵌入到向量空间中,构造节点Xi的特征矩阵
iii)进行图卷积GCN(Ai,Xi),通过遵循元路径Pi量化IOC之间的相互依赖关系,将其嵌入到低维空间中
Ⅴ.数据集及实验结果
(1) 与Standford NER和NLTK NER方法相比,它们一般使用新闻语料库训练,本文使用自定义收集的安全语料训练模型。
(2) 与基于规则的提取方法(如iACE和Stucco)不同,本文提出的基于深度学习的方法提供了一个性能更好的端到端系统来表示各种IOC。
(3) 与基于RNN的方法(如BiLSTM和BiLSTMCRF)相比,本文的方法引入了多粒度嵌入尺寸(字符级、1-gram、2-gram和3-gram),以同时学习不同大小和类型的IOC特征,可以识别更复杂和不规则的IOC。
(4) 本文的方法利用注意机制来学习不同尺度特征的权重,以有效地描述不同类型的特征,进一步提高了IOC识别的准确性。
Ⅵ.威胁智能计算技术的应用
CTI威胁分析和排名
攻击偏好建模
漏洞相似性分析
Data Availability
Model Extensibility
High-level Semantic Relations
Security Knowledge Reasoning
Ⅶ.结论和个人感受
1.结论
2.个人感受
这篇文章和我对威胁情报自动化提取的想法及实验非常相似(NER实现),但我的方法没有本文系统,尤其是算法创新和后面的应用实践,包括引言部分和动机都非常值得我去学习。真诚地感谢北航老师们的分享,让我学得很多,也进一步验证我的想法是有价值的。虽然撞车,但我学到的更多,后续我将进一步去优化自己的实验和idea,加油~
之前做过很多BiLSTM和CNN+Attention的实验研究,原来多粒度注意力机制就是这样的,字符级、n-gram相结合,和我2016年做的多视图融合算法有相似之处,当时实体对齐从text和inforbox两个视图优化。
NLP和安全结合来增强语义,图神经网络及GAN与二进制结合都是非常好的结合点,而且有很多内容可以去做,该篇论文在NLP领域是常见的命名实体识别(NER)问题,其模型仍然有很多优化的点,但是在CTI领域仍然比较新,且应用价值巨大。北航老师另一篇通过BERT来做威胁情报和入侵识别也有很多学习之处。
就我自己而言,虽然英文论文能够独立阅读,但英文写作和听读是致命的弱点,后续需要不断加强。此外,英文论文看得太少太少,好在现在已经放弃技术博客更新,转而扎进论文的学习和实验,好好珍惜这些奋斗的日子!读博不易,珍惜当下。
Ⅷ.英文优美十句
Cyber Threat Intelligence (CTI), as a collection of threat information, has been widely used in industry to defend against prevalent cyber attacks.
In this paper, we propose a novel CTI framework, HINTI, to model the interdependent relationships among heterogeneous IOCs to quantify their relevance.
Nowadays, we are witnessing a rapid growth of sophisticated cyber attacks (e.g., zero-day attack, advanced persistent threat). Such attacks can effortlessly bypass traditional defenses such as firewalls and intrusion detection systems (IDS), breach critical infrastructures, and cause devastating catastrophes. To combat these emerging threats, security experts proposed Cyber Threat Intelligence (CTI) that consists of a collection of Indicators of Compromise (IOCs).
Recent studies have proposed automated methods to extract CTI in the form of Indicator of Compromise (IOC) from unstructured security-related texts [4, 22]. Most of existing IOC extraction methods, such as CleanMX, PhishTank, IOC Finder, and Gartner peer insight, follow the OpenIOC [10] standard and extract particular types of IOCs (e.g., malicious IP, malware, file Hash, etc) by leveraging a set of regular expressions.
However, such extraction approaches face three major limitations. First, the accuracy of IOC extraction is low, which inevitably leads to the omission of critical threat objects [22]. Second, isolated IOC hardly depicts the comprehensive landscape of threat events, making it virtually impossible for CTI subscribers to gain a complete picture into the incoming threat. Third, there is a lack of an effective computing framework to efficiently measure the interactive relationships among heterogeneous IOCs.
To combat these limitations, HINTI, a cyber threat intelligence framework based on heterogeneous information network (HIN), is proposed to model and analyze CTIs.
Different from the existing CTI frameworks, HINTI aims to implement a computational CTI framework, which can not only extract IOCs efficiently but also model and quantify the relationships between them. Here, we use the motivating example to illustrate how HINTI works step-by-step in practice as follows.
Compared with Figure 1, it is obvious that HINTI can depict a more intuitive and comprehensive threat landscape than the previous approaches.
Particularly, our proposed CTI computing framework characterizes IOCs and their relationships in a low-dimensional embedding space, based on which CTI subscribers can use any classification (e.g., SVM, Naive Bayes) or clustering algorithms (K-Means, DBSCAN) to gain new threat insights, such as predicting which attackers are likely to intrude their systems, and identifying which vulnerabilities belong to the same category without the expert knowledge. In this work, we mainly explore three real-world applications to verify the effectiveness and efficiency of the CTI computing framework: IOC significance ranking (see Section 6.1), attack preference modeling (see Section 6.2), and vulnerability similarity analysis (see Section 6.3).
Recently, Bidirectional Long Short Term Memory+Conditional Random Fields (BiLSTM+CRF) model [15] has demonstrated excellent performance in text chunking and Named-entity Recognition (NER). However, directly applying this model to IOC extraction is unlikely to succeed, since threat texts usually contain a large number of threat objects with different grams and irregular structures. Consequently, we need an efficient method to learn the discriminative characteristics of IOCs with different sizes. In this paper, we propose a multi-granular attention based IOC extraction method, which can extract threat objects with different granularity.
“娜璋AI安全之家”主要围绕Python大数据分析、网络空间安全、人工智能、Web渗透及攻防技术进行讲解,同时分享CCF、SCI、南核北核论文的算法实现。娜璋之家会更加系统,并重构作者的所有文章,从零讲解Python和安全,写了近十年文章,真心想把自己所学所感所做分享出来,还请各位多多指教,真诚邀请您的关注!谢谢。