This article is reproduced from the public intelligence expert,Original address

I am going to write an article that predicts the AI ​​chip for the coming year and how Nvidia will respond to the challenge, but I quickly realized that the article was much longer than I expected. Since there is a lot to introduce, I decided to divide the article into 3 parts.

Part 1: Introduction, and analysis of big companies that want to challenge NVIDIA: Intel, AMD, Google, Xilinx, Apple, Qualcomm.

Part 2: Startups and Chinese companies, and the roles they might play. Part 3: NVIDIA's strategy to defend against potential competitors.

Part 3: NVIDIA's strategy to defend against potential competitors.

1, introduction

In the past five years, Nvidia has developed its data center business into a multi-billion dollar giant, but has never encountered a decent competitor. This is an amazing fact. In my memory, this is unparalleled in today's technological world. This rapid growth is driven by rapid adoption of artificial intelligence (AI) and high performance computing (HPC).GPUThe needs of the chip. Jensen Huang, CEO of Nvidia, likes to talk about the “Cambrian explosion” in the field of deep learning, especially the rapid pace of innovation in neural network algorithms. We will discuss the meaning of this pair of Nvidia in the 3 section, but I chose to borrow this concept as the title of this series. We are at the doorstep of professional AI chip explosions in many large and small companies around the world. Three years ago, it was almost impossible for chip start-ups to get venture capital. Now, there are dozens of well-funded challengers building chips for artificial intelligence.

Figure 1: NVIDIA compares the explosive development of a new neural network to the Cambrian era.

Last year, Nvidia and IBM reached the peak of computing, and they announced that they are the world's fastest supercomputer - the US Department of Energy's Oak Ridge National Laboratory (ORNL) Summit supercomputer (about 95% performance thanks to NVIDIA's Volta GPU ) Provide power. Although this is an incredible achievement, many people are beginning to doubt whether the entire fairy tale will last for Nvidia.

Figure 2: The Summit Supercomputer at the US Department of Energy's Oak Ridge National Laboratory is the fastest computer in the world today.

According to the latest quarterly report, NVIDIA data center revenue increased by 58% to 7.92 billion, accounting for nearly 25% of the company's total revenue. In the past 4 quarters, this figure totaled 28.6 billion dollars. If the company can sustain this growth, by the year 2019, data center revenue will reach 45 billion. It sounds like heaven, or at least heaven on earth, right?

There is no doubt that NVIDIA has created a superior product driven by its strong and scalable architecture vision. NVIDIA now has a strong and self-sustaining ecosystem of software, universities, start-ups and partners that make it the owner of the new world they have created. While some would think that this ecosystem created an insurmountable moat, the dark clouds are now appearing on the horizon. The potential threats come from Intel, Google, AMD, and dozens of US and Chinese startups, all of which are attracted by the hot artificial intelligence.

So far, in my opinion, the competition is mainly a small fight. Competitors have issued dozens of statements, but I am very convinced that no company other than Google actually earned any income from NVIDIA's vault. Let's take a look at the current competitive landscape and see what 2019 will look like.

Large challenger

Despite the fact that many 40 startups have entered the field, let's be realistic: only a few companies can really succeed in this market (such as revenues over 10 billion). In the training of deep neural networks, NVIDIA is hard to beat, considering NVIDIA's products, installation base and the ubiquitous ecosystem. However, the current relatively small reasoning market will eventually exceed the total revenue of the training market. Unlike training, reasoning is not a single market. It consists of a large number of data types in the cloud and edges and associated optimized deep learning algorithms, each with specific performance, power and latency requirements. In addition, there is no Big Mac in the reasoning market, even in the automotive market where NVIDIA claims to have a leading position. For these reasons, reasoning is the primary or primary concern of most new entrants. Let's take a look at the big companies that are vying for seats.


The earliest proof of a dedicated chip (calledASICOne of the companies that can combat more programmable, more versatile GPUs for deep learning is Google. Coincidentally, Google may be one of NVIDIA's largest customers. As I said before, Google has now released four "Tensor Processing Units" (TPUThese chips and boards can accelerate deep learning and inference processing in the cloud, and more recently in edge clouds. Google's TPU is fairly reliable for training and processing deep neural networks, providing up to 45 trillion operations per second (TOPS) per chip. In contrast, NVIDIA's Volta can be up to 125 TOPS. Google's first two TPUs are actually for internal use and boast, but Google now offers them as a service to its cloud customers on Google Compute Cloud.

Although the TPU has undoubtedly contributed to Google's artificial intelligence initiatives, the markets outside of Google's internal use cases (of course, this is a fairly large market) are intentionally restricted. TPU can only be used to train and run the Google TensorFlow AI framework; you can't use it to train or run AI built with Apache MxNet or PyTorch (the two are the fast-growing AI frameworks supported by Facebook and Microsoft). They also cannot be used for GPU-dominant non-AI HPC applications. In addition, you cannot purchase TPUs for internal computing in enterprise or government data centers and servers. But Google doesn't mind all of this because it believes TPU and TensorFlow are strategically important to their overall leadership in artificial intelligence. Software optimized for hardware and hardware optimized for software can build a powerful and long-lasting platform.

A more direct impact of the TPU may be to verify that the ASIC concept can be used as an alternative to the GPU, at least for potential investors. The CEO of a deep learning chip startup shared this experience with me: After Google announced its TPU, venture capital began to flow freely. He subsequently raised hundreds of millions of dollars.

Google has been good at grabbing some of the limelight from Nvidia’s predictable announcements at the GPU Technology Conference (usually in 3 months), and I’m not surprised to see Google show up again this year, perhaps with a performance data The eye-catching 7 nano TPU product.

Amazon Web Services is not far behind, the company announced last fall that it is also building a custom ASIC for inference processing. However, the chip is still under development and the company has not shared any details about design or availability.


Figure 3: Former Nervana CEO Naveen Rao leads Intel's AI product development and is transparent to the company's strategy.

This has become a bit complicated, because Intel is a big company and is doing a lot of work at the same time. Although Intel intends to compete with the Nervana chip for artificial intelligence training and reasoning at the end of 2019, it realizes that reasoning will become a bigger market. And has a very strong influence. In addition to the Xeon CPU (which has significantly improved inference performance since the recent update), Intel also acquired MobileEye and Movidius for automotive and embedded reasoning processing. I have seen demos of these two devices, and they are really impressive. Intel also invested in a Run-Anywhere software stack called DB2OpenVino, which allows developers to train anywhere and then optimize and run on all Intel processors, which is amazing.

At the CES conference in Las Vegas, Intel revealed that it is working closely with Facebook on the inferred version of the Nervana Neural Network Processor (NNP-I), which is surprising because many people predict that Facebook is developing its own Reasoning accelerator.

At the same time, Intel vice president and general manager of artificial intelligence products Naveen Rao shared on Twitter that NNP-I will be a SOC (system on a chip), manufactured at Intel 10 nanofab, and will include the IceLake x86 core. Rao said this will be a common theme for Intel in the future, possibly referring to future X86/GPU chips for desktops and laptops, similar to AMD's APU.

In terms of training, Intel's original plan was to release a product called "Lake Crest" Nervana NNP in 2017, one year after the acquisition of Nervana. Then it was dragged to the 2018 year... Eventually, the company decided to start over. This is unlikely because the first part of Nervana's completion is not good. Instead, Intel realized that the performance of the device was not enough to significantly exceed NVIDIA and TensorCores it added to Volta and subsequent GPUs. I think that when NVIDIA unveiled any amazing new products it made on the 7nm process, we will see the same play again, but this is a bit too far away.

Qualcomm and Apple

For the sake of completeness, I included both companies because they all focused on providing impressive artificial intelligence capabilities on mobile phones (and Qualcomm's IoT devices and self-driving cars). Of course, Apple focuses on the iPhone's A-series CPUs and the IOS operating system that supports mobile AI. As mobile phones become the dominant platform for artificial intelligence reasoning in the field of voice and image processing, the two companies have a large number of IPs that can be used to establish leadership (although Huawei is also pushing artificial intelligence, we will introduce it in the 2 section).


For the past three years, AMD has been working hard to get its AI software studio up and running. When I was working there in 2015, if you didn't start Windows, you couldn't even run its GPU on a Linux server. Since then, the company has made great strides, ROCm software and compilers simplify the migration from CUDA, and MlOpen (not to be confused with OpenML) speeds up the on-chip math library. However, AMD's GPU is still at least one generation behind the AI ​​version of NVIDIA V100, and V100 is close to two years. How AMD competes with NVIDIA TensorCores on 7 nm remains to be seen.


No doubt, programmable logic devices (FPGAXilinx, a leading supplier, has performed very well in 2018. In addition to announcing 7nm's next-generation architecture, it has also achieved significant success in the design of companies such as Microsoft, Baidu, Amazon, Alibaba, and Daimler-Benz. In artificial intelligence inference processing, FPGAThey have a distinct advantage over ASICs because they can be dynamically reconfigured for the specific job at hand. This is very important when the underlying technology is changing rapidly, as is the case with artificial intelligence. For example, Microsoft showed how its FPGAs (now from Xilinx and Intel) use 1 bits, 3 bits, or almost any precision mathematical calculations for specific layers in deep neural networks. This may be like a nerd, but it can greatly speed up processing and reduce latency while using less power. In addition, the upcoming Xilinx 7nm chip, called Versal, features an AI and DSP engine that accelerates the processing of specific applications while having an adaptable logic array. Versal will start shipping sometime this year, and I think it may change the rules of the game for reasoning.

2, startup company

This is the second of three articles on the state of the artificial intelligence chip market and what will happen in 2019 year. This year will be the feast of the new chip and benchmark battle, led by the big companies I mentioned in the first blog (Intel, Google, AMD, Xilinx, Apple, Qualcomm), in addition, there are dozens Silicon Valley startups and Chinese unicorns are valued at more than 10 billion. In this section, I will introduce the most famous startups in the West and China, or at least the most vocal startups. The Chinese government is working to build a local artificial intelligence chip industry. We'll start with Wave, and it seems to be the first company to use chips for training.

Wave Computing

Wave Computing went through a versatile 2018 year, launched its first data stream processing unit, acquired MIPS, created MIPS Open, and delivered its first early system to some lucky customers. Although the Wave architecture has some very interesting features, I will delve into it here, but we are waiting for customer experience information for large-scale actual workloads.

Wave is not an accelerator connected to the server; it is a standalone processor for graphics computing. This method has advantages and disadvantages. On the plus side, Wave is not affected by memory bottlenecks in accelerators such as GPUs. On the negative side, installing a Wave device will be a forklift upgrade that will completely replace the traditional X86 server and become a competitor to all server manufacturers.

I don't expect Wave to provide better results than NVIDIA on a single node, but its architecture is well designed, and the company has said it should get customer results very quickly. Please continue to pay attention!

Figure 1: The system shipped by Wave is built from the 4 node "DPU" board shown above.


Graphcore is a well-funded British unicorn startup with an investment of 3.1 billion and currently valued at 17, with a world-class team. It is building a novel graphics processor architecture with memory on its same chip as its logic, which will give higher performance to real-world applications. This team has been teasing new products that it will release for a long time. Last year, 4, it was "almost ready to go public," and the company's latest information for 12 last year showed that it will start production soon. Its list of investors is compelling, including Sequoia Capital, BMW, Microsoft, Bosch and Dell Technologies.

I've seen the architecture of Graphcore, which looks quite compelling, extending from edge devices to the "Colossus" two-chip package for data center training and reasoning. At the recent NeuroPS campaign, Graphcore showed off its RackScale IPU Pod, which provides more than 32 petaflp performance on a rack with 16 servers. Although Graphcore often claims that its performance will be 100 times better than best-in-class GPUs, my calculations are different.

Graphcore said that a 4 "Colossus" GC2 (8 chip) server provides 500 TFlops (trillion operations per second) of mixed precision performance. A single NVIDIA V100 can provide 125 TFlops, so theoretically, 4 V100s should provide the same performance. As always, the problem lies in the details. The V4 peak performance is only available when the code is refactored to perform the 4×100 matrix multiplication performed by TensorCore. The Graphcore architecture cleverly avoids this limitation. Not to mention that the V100 is expensive and consumes up to 300 watts of power. In addition, Graphcore supports on-chip interconnect and "processor memory" (on-chip memory) methods, which may bring excellent application performance beyond the TFlops benchmark. In some neural networks, such as generative adversarial networks (GAN), memory is the bottleneck.

Again, we will have to wait for real users to evaluate this architecture with actual application results. Still, Graphcore's investor list, expert roster and super-high valuation tell me that this might be a good thing.

Easyai public number

Figure 2: GraphCore shows this very cool image of the ImageNet dataset. This visualization helps developers understand which parts of the processing cycle their training process consumes.

Habana Labs

Last year, 9, Israel startup Habana Labs announced at the first artificial intelligence hardware summit that it is ready to launch the first chip for reasoning and run a convolutional neural network for image processing with record performance. Many people are surprised. The results show that in the Resnet50 image classification database, the processor classifies 15,000 images per second, which is about 4% higher than NVIDIA's T50, and consumes only 100 watts. 2018 12 Month, Habana Labs' latest round of financing led by Intel Venture Capital, WRV Capital, Bessemer Venture Partners and Battery Ventures, the company's financing also increased 4500 by 7500 million USD Ten thousand U.S. dollars. The most recent funding will be used in part for the second chip, called "Gaudi," which will focus on the training market and is said to scale to multiple 1000 processors. In this highly competitive field, Habana Labs has shown a lot of hope.

Other startups

I know that there are many 40 companies in the world that are making chips for artificial intelligence training and reasoning. I find that most companies are doing simple FMA (floating point multiply accumulating) and mixed precision mathematics (8 bit integers, 16 bits and 32 bit floating point numbers), which I am not surprised. This method is relatively easy to build and will give you some easy-to-pick fruits, but it doesn't provide lasting results compared to big companies like NVIDIA, Intel, and a handful of startups that develop cool architectures like Wave and GraphCore. Architectural advantages. Here are a few companies that caught my attention:

Groq:Founded by former Google employees who work in TPU, they have the ambition to rule the world. Tenstorrent: A former AMD employee in Canada was founded and is still in the confidential stage. I can only say that my CEO's vision and architecture have left a deep impression on me.

ThinCi:The Indian company, which specializes in edge devices and autonomous vehicles, has established partnerships with Samsung and Denso.

Cerebras:Led by former SeaMicro (AMD subsidiary) employees including Andrew Feldman, it is still in a deep "stealth" mode.

Mythic:A startup that uses a unique approach to edge-inference processing is similar to analog processing on non-volatile memory; chips should be released in 2019.

中国 公司

China has been trying to find a way out of dependence on US semiconductors, and artificial intelligence accelerators may provide the exports it has been seeking. The Chinese central government has set a goal of building a trillion-dollar artificial intelligence industry in 2030. Since 2012, investors have invested more than 40 billion dollars in startups. The US Congress claims that this is an artificial intelligence arms race. The US technology industry may lag behind as Chinese companies and research institutions are less concerned with promoting privacy and ethical issues that hinder Western progress.

Cambricon and SenseTime are probably the most noteworthy Chinese artificial intelligence companies, but companies like Horizon Robotics in the edge AI sector deserve attention. In addition, please pay close attention to large Internet companies such as Baidu, Huawei, Tencent and Alibaba, all of which have invested heavily in artificial intelligence software and hardware.

Cambrian Technology is a Chinese unicorn company that is valued at 25, and has released the third generation of artificial intelligence chips. The company claims that it offers approximately 100% performance advantage over NVIDIA V30 under low power conditions. Cambrian Technology also sells IP to customers and provides artificial intelligence hardware for the Huawei Kirin 970 mobile chipset.

Shang Tang Technology is perhaps the most highly valued artificial intelligence startup, and it is best known for promoting smart surveillance cameras throughout China. The number of these cameras exceeds 1.75, including cameras made by other companies. Shang Tang Technology was established in Hong Kong. The latest round of financing amounted to 6 billion, led by Alibaba. According to several media reports, the startup is currently valued at 45 billion. Shangtang Technology has established strategic partnerships with major companies such as Alibaba, Qualcomm, Honda and even Nvidia. The company now has a supercomputer running about 8000 (probably NVIDIA-provided) GPUs and plans to build 5 supercomputers to process facial recognition data collected by millions of cameras.


Now that I’ve shocked everyone who holds NVIDIA stocks and brought hope to those who spend a lot of money on NVIDIA GPUs, let’s take a realistic look at how NVIDIA maintains its leadership in a much more competitive market. status. We need to study the training and reasoning market separately.

History lesson from Nervana

First, let's take a look at Intel's experience with Nervana. Before being acquired by Intel, Nervana claimed that its performance would be at least 10 times higher than the GPU. Then, on the way to victory, something interesting happened: Nevada's TensorCores surprised everyone, it was stronger than Pascal, not 2 times, but 5 times. Next, NVIDIA redoubled its efforts on NVSwitch to build the amazing 8 GPU DGX-2 server (priced at 40 million, which is quite expensive), defeating most (and perhaps all) competitors. At the same time, NVIDIA's CuDNN library and driver performance almost doubled. It also builds a GPU cloud that makes using the GPU as simple as clicking and downloading an optimized software stack container, which can be used for approximately 30 deep learning and scientific workloads. So, as I shared in my previous article, Intel's promised 10 times performance advantage has disappeared, and the promise of a new Nervana chip at the end of 2019 is now back in the design phase. Basically, NVIDIA proves that in a virtual garage, 1 engineers with a solid resume and technical reserves can outperform 50's smart engineers. No one should be surprised, right?

Give a 10,000 engineers a big sandbox

Now, it's almost three years to 2019. Once again, competitors claim that their chips have 10 times or even 100 times the performance advantage, and all of this is still under development. NVIDIA still has a team of 10,000 engineers and maintains a technical partnership with the world's top researchers and end users. Now, they are all contributing to NVIDIA's next-generation 7nm chip. In my opinion, this will basically transform the company's products from "GPU chips with AI" to "AI chips with GPUs."

Figure 1: NVIDIA's DGX-2 supercomputer provides 16 peta-ops AI performance with 100 V2 GPUs connected to NVSwitch. 

How many additional logical areas does NVIDIA engineers need to add to the company's next-generation products? Although the analysis below is simple, it can effectively build an answer to this critical question.

Let's start with the first ASIC that seems to have excellent performance, the Google TPU. I saw the analysis that each Google TPU chip is about 2-2.5B transistors. Volta V100 has approximately 12B transistors in a 21nm manufacturing process. It is the largest chip that TSMC can manufacture. As Nvidia migrates from 12nm to 7nm, the chip can contain approximately 1.96 (1.4×1.4) times as many transistors. Therefore, in theory, if NVIDIA does not add any graphics logic (of course it is unlikely), it will have another 200 billion transistors available, which is about ten times the logic of the entire Google TPU. Suppose my logic part takes up 2 times. In this case, NVIDIA engineers still have 5 times the logic available for new AI functions. Now, all of this assumes that Nvidia will go all out to pursue performance, rather than reducing costs or power. However, in the training market, this is exactly what users need: shorten training time. There are many ideas about what Nvidia might offer, including processor memory and more versions of tensorcore.

My point is that NVIDIA has no doubt that it has enough expertise and available chip space to innovate, just as it does on tensorcore. I talked to many interesting AI chip startups, but the companies I respect most told me that they didn't underestimate NVIDIA or that they were trapped in the GPU's mindset. NVIDIA DLA and Xavier, an ASIC and a SOC, respectively, demonstrate that NVIDIA can create a wide variety of accelerators, not just GPUs. As a result, many of the CEOs of these startups decided not to adopt NVIDIA's approach, but instead focused on reasoning first.

I don't think NVIDIA will be in a disadvantage for training for a long time. The problem may be that the cost of the chip is high, but in terms of training, the customer will pay. In addition, in terms of reasoning, NVIDIA's Xavier is an impressive chip.

The Cambrian explosion is good for programmability

Let us return to the view of the Cambrian explosion. Nvidia correctly pointed out that we are in the early stages of algorithm research and experimentation. An ASIC that does a good job of processing (such as a convolutional neural network for image processing) may (and will almost certainly) do a bad job of processing (eg, GAN,RNNOr a neural network yet to be invented). Here is where GPU programmability is combined with NVIDIA's researcher ecosystem. If NVIDIA can solve the upcoming memory problems, the GPU can adapt quite quickly to a new neural network approach. By using NVLink to create a mesh structure consisting of 8 GPUs and 256 GB High Bandwidth (HBM) memory, NVIDIA has significantly reduced memory capacity issues at a high cost. We will have to wait for its next-generation GPU to see if and how it solves latency and bandwidth problems that require approximately 10 times the memory of HBM.

Reasoning war

As I wrote in the 1 section of this series, there is no Big Mac in the field of reasoning, the edge and data center reasoning market is diversified and ready to grow rapidly, but I have to doubt that from profit From a rate perspective, whether large-scale reasoning markets will be a particularly attractive market. After all, in the future commodity market, profit margins can be quite modest as many companies are scrambling for attention and sales. Some reasoning is simple and some are very difficult. The latter market will maintain a high profit margin, because only complex SOCs equipped with parallel processing engines such as CPU, Nervana, GPU, DSP and ASIC can provide the performance required for autonomous driving. Intel's Naveen Rao recently posted on Twitter that the Nervana reasoning processor will actually be an 10nm SOC with an Ice Lake CPU core. NVIDIA has taken the lead in using the Xavier SOC for autonomous driving, and Xilinx will use the Versal chip for autonomous driving in a similar way later this year. Any startup that goes on this path needs to have two things: a) a very good "performance/watt" value, and b) an innovation roadmap that keeps them ahead of commodities.

in conclusion

In short, I want to reiterate the following points:

The future of 1 and AI is realized by dedicated chips, and the market for dedicated chips will become huge.

2, the world's largest chip companies, intends to win in the future artificial intelligence chip war. Although Intel is catching up, don't underestimate its capabilities.

3, there are many well-funded start-ups, some of which will succeed. If you want to invest in a VC-backed company, make sure they don't underestimate the strength of NVIDIA.

In 4 and the future 5 year, China will largely shed its reliance on American artificial intelligence technology.

5 and NVIDIA have more than 1 engineers, and the next generation of high-end GPUs for artificial intelligence may surprise us all.

The 6, reasoning market will grow rapidly and will have room for many application-specific devices. FPGAs may play an important role here, especially Xilinx's next-generation FPGAs.

Obviously, there is a lot to introduce about this topic, and I just touched the fur! Thank you for taking the time to read this series of articles, I hope it is instructive and informative.  

Easyai public number