Processors for AI : a List

查看原文

其他

Processors for AI : a List

原创 2017-08-09 唐杉 StarryHeavensAbove

我在Github上建了一个Deep Learning相关的芯片和IP的列表（主要是业界的产品或项目），会定期更新。由于公众号文章不支持外部链接，大家请点击文末阅读原文查看完整的内容。

Nvidia

GPU

Nvidia's latest GPU can do 15 TFlops of SP or 120 TFlops with its new Tensor core architecture which is a FP16 multiply and FP32 accumulate or add to suit ML.
Nvidia is packing up 8 boards into their DGX-1for 960 Tensor TFlops.
Nvidia Volta - 架构看点 gives some insights of Volta architecture.

SoC

On edge, Nvidia provide NVIDIA DRIVE™ PX, The AI Car Computer for Autonomous Driving and JETSON TX1/TX2 MODULE "The embedded platform for autonomous everything"

Open Source DLA from Nvidia

Nvidia anouced "XAVIER DLA NOW OPEN SOURCE" on GTC2017. We did not see Early Access verion yet. Hopefully, the general release will be avaliable on Sep. as promised. For more analysis, you may want to read 从Nvidia开源深度学习加速器说起.

AMD

GPU

The soon to be released AMD Radeon Instinct MI25 is promising 12.3 TFlops of SP or 24.6 TFlops of FP16. If your calculations are amenable to Nvidia's Tensors, then AMD can't compete. Nvidia also does twice the bandwidth with 900GB/s versus AMD's 484 GB/s.

Intel

Nervana

Intel purchased Nervana Systems who was developing both a GPU/software approach in addition to their Nervana Engine ASIC. Comparable performance is unclear. Intel is also planning in integrating into the Phi platform via a Knights Crest project. NextPlatform suggested the 2017 target on 28nm may be 55 TOPS/s for some width of OP. There is a NervanaCon Intel has scheduled for December, so perhaps we'll see the first fruits then.

FPGA

Intel FPGA OpenCL and Solutions.

Google TPU

Google's original TPU had a big lead over GPUs and helped power DeepMind's AlphaGo victory over Lee Sedol in a Go tournament. The original 700MHz TPU is described as having 95 TFlops for 8-bit calculations or 23 TFlops for 16-bit whilst drawing only 40W. This was much faster than GPUs on release but is now slower than Nvidia's V100, but not on a per W basis. The new TPU2 is referred to as a TPU device with four chips and can do around 180 TFlops. Each chip's performance has been doubled to 45 TFlops for 16-bits. You can see the gap to Nvidia's V100 is closing. You can't buy a TPU or TPU2. Google is making them available for use in their cloud with TPU pods containing 64 devices for up to 11.5 PetaFlops of performance. The giant heatsinks on the TPU2 are some cause for speculation, but the market is changing from devices to units with groups of devices and also such groups within the cloud.

Other references are:

Xilinx

Xilinx provide "Machine Learning Inference Solutions from Edge to Cloud" and naturally claim their FPGA's are best for INT8 with one of their white papers.

Whilst performance per Watt is impressive for FPGAs, the vendors' larger chips have long had earth shatteringly high chip prices for the larger chips. Xilinx's VU9P lists at over $US 50k at Avnet. Finding a balance between price and capability is the main challenge with the FPGAs.

Microsoft FPGA

Microsoft has thrown its hat into the FPGA ring, "Microsoft Goes All in for FPGAs to Build Out AI Cloud."
Wired did a nice story on the MSFT use of FPGAs too, "Microsoft Bets Its Future on a Reprogrammable Computer Chip"
Inside the Microsoft FPGA-based configurable cloud is also a good reference if want to know Microsoft's vision on FPGA in cloud.

Qualcomm

Qualcomm has been fussing around ML for a while with the Zeroth SDK and Snapdragon Neural Processing Engine. The NPE certainly works reasonably well on the Hexagon DSP that Qualcomm use. The Hexagon DSP is far from a very wide parallel platform and it has been confirmed by Yann LeCun that Qualcomm and Facebook are working together on a better way in Wired's "The Race To Build An AI Chip For Everything Just Got Real", "And more recently, Qualcomm has started building chips specifically for executing neural networks, according to LeCun, who is familiar with Qualcomm's plans because Facebook is helping the chip maker develop technologies related to machine learning. Qualcomm vice president of technology Jeff Gehlhaar confirms the project. "We're very far along in our prototyping and development," he says." Perhaps we'll see something soon beyond the Kryo CPU, Adreno GPU, Hexagon DSP, and Hexagon Vector Extensions. It is going to be hard to be a start-up in this space if you're competing against Qualcomm's machine learning.

Apple

Will it or won't it? Bloomberg is reporting it will as a secondary processor but there is little detail. Not only is it an important area for Apple, but it helps avoid and compete with Qualcomm.

ARM

DynamIQ is embedded IP giant's answer to AI age. It may not be a revolutionary design but is important for sure.

IBM TrueNorth

TrueNorth is IBM's Neuromorphic CMOS ASIC developed in conjunction with the DARPA SyNAPSE program.

It is a manycore processor network on a chip design, with 4096 cores, each one simulating 256 programmable silicon "neurons" for a total of just over a million neurons. In turn, each neuron has 256 programmable "synapses" that convey the signals between them. Hence, the total number of programmable synapses is just over 268 million (228). In terms of basic building blocks, its transistor count is 5.4 billion. Since memory, computation, and communication are handled in each of the 4096 neurosynaptic cores, TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors. Wikipedia

HiSilicon(华为海思）

Yu Chengdong, Huawei CEO, recently announced at the 2017 China Internet Conference that Huawei is developing an AI processor.

No more details yet.

Cambricon（寒武纪）

Cambricon is working on IP License, Chip Service, Smart Card and Intelligent Platform.

Horizon Robotics（地平线机器人）

Horizon Robotics has a Brain Processing Unit (BPU) in the works.

Deephi（深鉴科技）

DeePhi Tech has the cutting-edge technologies in deep compression, compiling toolchain, deep learning processing unit (DPU) design, FPGA development, and system-level optimization.

Bitmain（比特大陆）

Bitcoin Mining Giant Bitmain is developing processors for AI.

Wave Computing

Wave’s Compute Appliance is capable to run TensorFlow at 2.9 PetaOPS/sec on their 3RU appliance. Wave refers to their processors at DPUs and an appliance has 16 DPUs. Wave uses processing elements it calls Coarse Grained Reconfigurable Arrays (CGRAs). It is unclear what bit width the 2.9 PetaOPS/s is referring to. From their white paper, the ALUs can do 1b, 8b, 16b and 32b,

"The arithmetic units are partitioned. They can perform 8-b operations in parallel (ideal for DNN inferencing) as well as 16-b and 32-b operations (or any combination of the above). Some 64-b operations are also available and these can be extended to arbitrary precision using software."

Here is a bit more on one of the 16 DPUs included in the appliance,

"The Wave Computing DPU is an SoC that contains a 16,384 PEs, configured as a CGRA of 32x32 clusters. It includes four Hybrid Memory Cube (HMC) Gen 2 interfaces, two DDR4 interfaces, a PCIe Gen3 16-lane interface and an embedded 32-b RISC microcontroller for SoC resource management. The Wave DPU is designed to execute autonomously without a host CPU."

On TensorFlow ops,

"The Wave DNN Library team creates pre-compiled, relocatable kernels for common DNN functions used by workflows like TensorFlow. These can be assembled into Agents and instantiated into the machine to form a large data flow graph of tensors and DNN kernels." "...a session manager that interfaces with machine learning workflows like TensorFlow, CNTK, Caffe and MXNet as a worker process for both training and inferencing. These workflows provide data flow graphs of tensors to worker processes. At runtime, the Wave session manager analyzes data flow graphs and places the software agents into DPU chips and connects them together to form the data flow graphs. The software agents are assigned regions of global memory for input buffers and local storage. The static nature of the CGRA kernels and distributed memory architecture enables a performance model to accurately estimate agent latency. The session manager uses the performance model to insert FIFO buffers between the agents to facilitate the overlap of communication and computation in the DPUs. The variable agents support software pipelining of data flowing through the graph to further increase the concurrency and performance. The session manager monitors the performance of the data flow graph at runtime (by monitoring stalls, buffer underflow and/or overflow) and dynamically tunes the sizes of the FIFO buffers to maximize throughput. A distributed runtime management system in DPU-attached processors mounts and unmounts sections of the data flow graph at run time to balance computation and memory usage. This type of runtime reconfiguration of a data flow graph in a data flow computer is the first of its kind."

Some more details can be fund in this article: AI芯片浅析Yann LeCun提到的两款Dataflow Chip

Graphcore

Graphcore raised $30M of Series-A late last year to support the development of their Intelligence Processing Unit, or IPU. Resently, co-founder and Chief Technology Officer, Simon Knowles, was invited to give a talk at the 3rd Research and Applied AI Summit (RAAIS) in London, showing interesting ideas behind their processor.

解密又一个xPU：Graphcore的IPU give some analysis on its IPU architecture.

PEZY Computing K.K.

Pezy-SC and Pezy-SC2 are the 1024 core and 2048 core processors that Pezy develop. The Pezy-SC 1024 core chip powered the top 3 systems on the Green500 list of supercomputers back in 2015. The Pezy-SC2 is the follow up chip that is meant to be delivered by now, but details are scarce yet intriguing,

"PEZY-SC2 HPC Brick: 32 of PEZY-SC2 module card with 64GB DDR4 DIMM (2.1 PetaFLOPS (DP) in single tank with 6.4Tb/s" It will be interesting to see what 2,048 MIMD MIPS Warrior 64-bit cores can do. In the June 2017 Green500 list, a Nvidia P100 system took the number one spot and there is a Pezy-SC2 system at number 7. So the chip seems alive but details are thin on the ground. Motoaki Saito is certainly worth watching.

KnuEdge's KnuPath

Their product page has since June 2016 gone missing in action. Not sure what they are up to with the $100M they put into their MIMD architecture. It was described at the time as having 256 tiny DSP, or tDSP, cores on each ASIC along with an ARM controller suitable for sparse matrix processing in a 35W envelope. The performance is unknown, but they compared their chip to a current NVIDIA, at that time, and said they had 2.5 times the performance. We know Nvidia is now more than ten times faster with their Tensor cores so KnuEdge will have a tough job keeping up. A MIMD or DSP approach will have to execute awfully well to take some share in this space.

Tenstorrent

Tenstorrent is a small Canadian start-up in Toronto claiming an order of magnitude improvement in efficiency for deep learning, like most. No real public details but they're are on the Cognitive 300 list.

Cerebras

Cerebras is notable due to its backing from Benchmark and that its founder was the CEO of SeaMicro. It appears to have raised $25M and remains in stealth mode.

Thinci

Thinci is developing vision processors from Sacremento with employees in India too. They claim to be at the point of first silicon, Thinci-tc500, along with benchmarking and winning of customers already happening. Apart from "doing everything in parallel" we have little to go on.

Koniku

Koniku's web site is counting down to "your new reality". They have raised very little money and after watching their Youtube clip embedded in this Forbes page, you too will not likely not be convinced, but you never know. Harnessing biological cells is certainly different. It sounds like a science project, but, then this,

"We are a business. We are not a science project," Agabi, who is scheduled to speak at the Pioneers Festival in Vienna, next week, says, "There are demands that silicon cannot offer today, that we can offer with our systems." The core of the Koniku offer is the so-called neuron-shell, inside which the startup says it can control how neurons communicate with each other, combined with a patent-pending electrode which allows to read and write information inside the neurons. All this packed in a device as large as an iPad, which they hope to reduce to the size of a nickel by 2018.

Adapteva

Adapteva: "Adapteva tapes out Epiphany-V: A 1024-core 64-bit RISC processor." Andreas Olofsson taped out his 1024 core chip late last year and we await news of its performance. Epiphany-V has new instructions for deep learning and we'll have to see if this memory-controller-less design with 64MB of on-chip memory will have appropriate scalability. The impressive efficiency of Andrea's design and build may make this a chip we can all actually afford, so let's hope it performs well.

Knowm

Knowm talks about Anti-Hebbian and Hebbian (AHaH) plasticity and memristors. Here is a paper covering the subject, "AHaH Computing–From Metastable Switches to Attractors to Machine Learning."

Mythic

A battery powered neural chip from Mythic with 50x lower power.

Kalray

Despite many promises,Kalray has not progressed their chip offering beyond the 256 core beast covered back in 2015, "Kalray - new product meander." Kalray is advertising their product as suitable for embedded self-driving car applications. Kalray has a Kalray Neural Network (KaNN) software package and claims better efficiency than GPUs with up to 1 TFlop/s on chip. Kalrays NN fortunes may improve with an imminent product refresh and just this month Kalray completed a new funding that raised $26M. The new Coolidge processor is due in mid-2018 with 80 or 160 cores along with 80 or 160 co-processors optimised for vision and deep learning.

Brainchip

Brainchip's Spiking Neuron Adaptive Processor (SNAP) will not do deep learning and is a curiosity without being a practical drop in CNN engineering solution, yet. IBM's stochastic phase-change neurons seem more interesting if that is a path you wish to tread.

Groq

Groq is founded by Ex-googlers, who designed Google TPU.

Aimotive

This BDTi artical shows some information of aiWare IP of Aimotive .

Speaking of chips, AImotive and partner VeriSilicon are in the process of designing a 22 nm FD-SOI test chip, which is forecast to come out of GlobalFoundries' fab in Q1 2018 (Figure 4). It will feature a 1 TMAC/sec aiWare core, consuming approximately 25 mm2 of silicon area; a Vivante VIP8000-derivative processor core will inhabit the other half of the die, and between 2-4 GBytes of DDR4 SDRAM will also be included in the multi-die package. The convolution-tailored LAM in this test chip, according to Feher, will have the following specifications (based on preliminary synthesis results): 2,048 8x8 MACs Logic area (including input/output buffering logic, LAM control and MACs): 3.45mm2 Memory (on-chip buffer): in the range of 5-25mm2 depending on configuration (10-50 Mbits).

Deep Vision

Deep Vision is bulding low-power chips for deep learning. Perhaps one of these papers by the founders have clues, "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing" [2013] and "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing" [2015].

Deep Scale

DeepScale raises $3 million for perception AI to make self-driving cars safe

REM

Reduced Energy Microsystems are developing lower power asynchronous chips to suit CNN inference. REM was Y Combinator's first ASIC venture according to TechCrunch.

Leepmind

We are carrying out research on original chip architectures in order to implement Neural Networks on a circuit enabling low power DeepLearning

KAIST DNPU

Face Recognition System “K-Eye” Presented by KAIST
从ISSCC Deep Learning处理器论文到人脸识别产品

Synopsys Embedded Vision

DesignWare EV6x Embedded Vision Processors
处理器IP厂商的机器学习方案 - Synopsys

CEVA XM6

CEVA-XM6 Fifth-generation computer vision and deep learning embedded platform
处理器IP厂商的机器学习方案 - CEVA

VeriSilicon VIP8000

VeriSilicon’s Vivante VIP8000 Neural Network Processor IP Delivers Over 3 Tera MACs Per Second
神经网络DSP核的一桌麻将终于凑齐了

Cadence P5/P6/C5

Tensilica Vision DSPs for Imaging, Computer Vision, and Neural Networks

Reference

FPGAs and AI processors: DNN and CNN for all

题图及插图均来自网络，版权归原作者所有

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

生成图片，分享到微信朋友圈