EDGE Insights

EDGE Insights

icon
Filter

Supercharging AI progress: The possibilities and pitfalls of massive models

Can machines imitate human intelligence? Questions like this continue to drive the pursuit of artificial general intelligence (AGI)—intelligent agents with the full spectrum of human cognitive capabilities. The past few decades have been characterized by steady strides in the fields of AI and machine learning (ML), supported by the rapid growth of big data, the deep learning revolution, the scaling of computing power, and the emergence of transformer-based architecture.
The last two years alone have seen the meteoric rise of large AI models or artificial neural networks with the potential to “learn” like humans. From splashy media demos to the odd claim of sentience, these models have attracted a lot of attention for an emerging set of capabilities including mimicking human conversation, summarizing text, generating photorealistic imagery, completing lines of code, playing video games, and even studying medical images to enable healthcare resource planning. On the flip side, the emergence of these large models has fostered a heated debate on the financial and environmental costs of training and running such computation-intensive models as well as the risks of misuse. This insight delves into the potential of these models while exploring the challenges they present. 

What are large AI models?

Simply put, an AI model is a software program or algorithm trained on a set of data to perform specific tasks like pattern recognition, text generation, visual question answering, and more. Large AI models—the focus of this insight—refer to programs or algorithms that have been trained on historical training data and have grown in complexity to reach billions of parameters, enabling them to approach tasks with greater skill and accuracy than smaller counterparts and predecessors. These models also tend to be tens of gigabytes (GB) in size and are trained on vast amounts of text data, sometimes at the petabyte (PB) scale. Currently, large AI models broadly exist in either the textual (language) or visual domains or the overlap between these modalities (multimodal).

Types of large AI models

What defines a “large” AI model is its number of parameters. Parameters are values or “settings” in machine learning algorithms (i.e., the weights and coefficients that the model extracts and estimates from training data and uses to develop outputs). For example, language models train by adjusting parameters, blanking out words, and comparing their predictions with reality. As such, parameters represent the part of the model that has learned from historical training data. 
Generally, the more parameters a model has, the more information it can digest from its training data, the more accurate its predictions, and the more sophisticated the model. Some large AI models may even be described as “neural networks” in terms of how they try to mimic the human brain through a set of deep learning algorithms. For example, OpenAI’s GPT-3 (175 billion parameters) was the largest of the current generation of large language models when it first emerged in 2020. The model is also referred to as a neural network because it leverages deep learning algorithms trained on internet data to produce human-like text.
For recent updates on OpenAI’s models, please also check out OpenAI: The ChatGPT creator leading large AI commercialization.

Large AI model parameters during 2020-2022

AI model parameters - 2x
Note: This visual excludes Meta’s Facebook DLRM since this model is an outlier at 12 trillion parameters) and has no comparable equivalents.
Source: SPEEDA Edge based on multiple sources

How did these models evolve?

The largest AI models in existence were all released during 2020-2022. It is worth noting, however, that this scaling in size, complexity, and performance did not happen overnight. The ideas, deep learning algorithms, and artificial neural networks at the core of large AI models have been propagating since the 1960s. However, interest in AI and ML only truly rocketed in the late 80s and early 90s, when neural networks were found capable of problem-solving by accepting raw input data and building up hierarchical representations to perform predictive tasks. 
However, at the time, complex problem-solving was limited by available computation power. For a while, computers could only handle simple, small-scale problems like MNIST (a handwritten digit classification problem). It wasn’t until the tail-end of the 2000s, after two more decades of computational performance improvements driven by Moore’s Law, that computers finally became powerful enough to train large neural networks. 
Overall, four key developments have guided the evolution of the current class of large AI models.

1. The rise of big data 

Big data and AI work together. ML (the subset of AI that accounts for how computers become smart) solves tasks by learning from data and making predictions. The explosion of data over the past decade has been instrumental to the emergence of large AI models. In 2012, IBM estimated that over 90% of the world’s data had been created in the preceding two years alone. At the time, the global data supply stood at an estimated 12.8 zettabytes (ZB) or 12.8 trillion gigabytes (GB). By 2020, the estimated total amount of data created, captured, copied, and consumed in the world stood at 59 ZB—the equivalent of 59 trillion GB. By 2025, this is predicted to reach 175 ZB. On a daily basis, we generate an estimated 500 million tweets, 294 billion emails, 4 million GB of Facebook data, 65 billion WhatsApp messages, and 720,000 hours of new content on YouTube
The intersection of big data and AI is central to mining value from information. OpenAI’s GPT-3 was trained with over 45 terabytes (TB) of data from the internet including Wikipedia and books, while Google’s Switch Transformer was trained on the Colossal Clean Crawled Corpus, a 750 GB dataset composed of text snippets from Wikipedia, Reddit, etc.

2. AI/ML advances and the deep learning revolution

Over the past decade, there has been a clear spike in research output in the field of machine learning and its applications. Arxiv, the open-access e-print repository for scholarly articles, has logged over 32 times as many ML-related papers in 2018 as in 2009 (doubling every two years). On a daily basis, over 100 ML-related papers are posted to Arxiv, with no indication of slowing down. These leaps in research are also evident through performance improvements in ML-based computer vision models. Consider the ImageNet challenge—a large-scale visual recognition challenge where contestants are provided a training set of 1 million color images across 1,000 categories and must train models in image classification and object recognition. Before the use of deep learning approaches in this contest, winning entrants used hand-engineered computer vision features, with the top-5 error rate being above 25%. From 2011 to 2017, the use of deep learning approaches and neural networks resulted in the winning ImageNet error rate plummeting to 2.3% (2017), a pronounced improvement.

ImageNet contest winning entry top-5 error rate (%)

3. Growth in compute

During the pre-deep learning era (1952–2010), the amount of compute used in AI training runs (simplified to computing performance) grew in line with Moore’s Law, doubling around every 18–20 months. After the advent of deep learning in the early 2010s, the scaling of training compute accelerated, doubling approximately every six months for regular-scale models. In 2015/16, the first large-scale models emerged, starting with models like Google’s AlphaGo (the first computer program to defeat a professional human Go player). These large-scale models display a 10-month doubling time. Improvements in compute represent a key component of AI progress. For as long as this trend of scaling continues, AI systems have the potential to evolve far outside today’s capabilities.

4. Development of transformer neural network architecture

Neural networks and more specifically, artificial neural networks (ANNs) mimic the human brain through a set of deep learning algorithms. These networks are a key approach for language-understanding tasks including language modeling, machine translation, and question answering. In 2017, Google invented and open-sourced a new neural network architecture called “transformer” based on a self-attention mechanism that mimics cognitive attention. In simple terms, the self-attention mechanism enabled large AI models to focus only on certain parts of the input and reason more effectively about which data is most relevant based on the context. The Google team claimed that the transformer architecture worked better than previous leading approaches such as recurrent neural networks (RNN) and convolutional neural networks (CNN). Currently, most of the large AI models are transformer-based (except for Meta’s SEER, which is CNN-based).

Who are the main players?

Due to the cost of training and running large AI models, the space has high barriers to entry and is currently led by big tech companies, other established global tech players, and AI labs. Open source AI collectives also stand to impact the space but at present, there is only one that fits the bill—BigScience.
Currently, US-based big tech players are the frontrunners in the space, followed by tech incumbents in China, South Korea, and Russia. AI labs also play a critical role, either on a standalone basis or as big tech partners in AI research. For instance, Microsoft has invested USD 1 billion in OpenAI to support the development of artificial general intelligence with broadly distributed economic benefits. Microsoft’s coding product Copilot is powered by OpenAI’s Codex. Similarly, both Huawei and Baidu have developed large AI models in collaboration with China-based AI lab Peng Cheng Laboratory. 
Key players also tend to collaborate (both among each other and beyond) on the hardware and software systems that support these large AI models. For instance, LG’s model Exaone was built through a strategic partnership with Google, where the latter provided LG with the TPU u4 AI chip (for computation) while Google Brain created a compatible software framework for Exaone. Meanwhile, Microsoft and Nvidia collaborated to train MT-NLG, leveraging state-of-the-art supercomputing clusters like the Nvidia Selene and Microsoft Azure NDv4.

The companies behind large AI models


What can these models do?

Let’s take a look at model capabilities and check out some of the most memorable “snapshots” of how these models perform in select applications. 
Take large language models (LLMs) for example. Question answering is a typical task for these models but LLMs can be trained in different ways, lending themselves to varied outcomes and applications. Google’s LaMDA, for instance, was trained on dialogue and generates the kind of free-flowing responses that could redefine the field of conversational AI. 
The first snapshot below is from a 2021 demo in which the Google team asks LaMDA a series of questions that the AI answers from the vantage point of Pluto, the dwarf planet. The chat starts with LaMDA sharing concrete facts and events like the New Horizons probe that visited Pluto in 2015. As the conversation develops, “Pluto” goes on to remark that it is “not just a random ice ball” and is “actually a beautiful planet” that “doesn’t get the recognition” it deserves because “sometimes people refer to (it) as just a dwarf planet”. 

Snapshot: Conversation with Google’s LaMDA as Pluto 

Google LaMDA pluto
While humor and sarcasm are complex traits that are difficult for AI to reproduce, some large AI models can be fed article topics and even asked to generate funny headlines. For instance, Israel-based AI21 Labs’ Jurassic 1 was issued a command to come up with a catchy screamer headline for a column about Elizabeth II’s small feet. The AI model came back with “Queen Size.
Fine-tuned language models are just as interesting in terms of the possibilities they present. The following snapshot from OpenAI’s 2021 demo of Codex indicates how text-generating AI models can be harnessed to reduce the coding burden on software developers. In this instance, the programmer types out instructions in English and Codex converts them into code. Notably, Codex can function in over a dozen programming languages including JavaScript, Perl, PHP, and Ruby.

Snapshot: Creating a space game with OpenAI Codex

OpenAI Codex
 Source: OpenAI (Codex)
The following snapshot indicates how computer vision models like Meta’s SEER perform on image recognition tasks relative to other state-of-the-art models trained on the ImageNet database. To test the model’s robustness to adversarial attacks, Meta also evaluated SEER’s performance on images that have been distorted through blurring, insertions, cropping, and other editing techniques.

Snapshot: Image recognition with Meta’s SEER

Meta SEER
Multimodal models focused on text-to-image generation are built on the intersection of textual and visual capabilities and hold exciting possibilities for design, image prototyping, and other fields. The snapshot below is the visual created by Google’s Imagen in response to the following text prompt: A majestic oil painting of a raccoon Queen wearing a red French royal gown.The painting is hanging on an ornate wall decorated with wallpaper.

Snapshot: Generating images from text with Google’s Imagen

Google imagen
Multimodal models like DALL-E, DALL-E 2, Parti, and Imagen operate with a deep level of language understanding and with photorealistic image-generation capabilities. They are only limited by the human imagination in terms of the variety of potential visual representations available.
For a detailed view of the tasks completed by large AI models, please refer to Appendix: Functionality by model. 

What are their potential uses and applications? 

Most developers of large AI models have either leveraged them to strengthen in-house products and services or actively seek research collaborations to explore the full potential of their models while mitigating the risks of misuse. It has been typical for big tech to only provide limited access to their models and for clearly defined uses (e.g., Meta has granted limited access to OPT for research use cases). Big tech players have also expressed reservations about open sourcing their models (see the limitations section for more detail).
Only a few players are ahead of the pack in terms of commercialization plans. As one of the earliest LLMs on the scene, OpenAI’s GPT-3 reportedly has over 300 applications under its belt, while its fine-tuned, coding-focused descendant Codex has over 70 applications. However, the latter is only offered through a closed beta program. Meanwhile, only a few models are being released more universally, whether through open betas (like AI21 Labs’ Jurassic 1) or open sourcing (like BigScience’s BLOOM or Meta’s Facebook DLRM). 

What is driving demand for large AI models?

1. Growing importance of digital customer experience management

Consumers have increasingly moved toward online channels, with this trend only accelerated by the pandemic. A 2020 McKinsey study indicated that organizations are now three times likelier (than before the pandemic) to report that at least 80% of their customer interactions were digital. Customer-facing elements are among the areas impacted, with around 62% of organizations experiencing increasing customer demand for online purchasing and services. 
As customers increasingly engage with enterprises on a digital basis, enterprises are doubling down on investments like conversational AI. Some predictions suggest that up to 40% of all communications on mobile platforms may eventually be managed by conversational AI (chatbots and smart assistants). These developments present interesting opportunities for LLMs. Particularly those with a focus on free-flowing conversational applications like Google’s LaMDA. Unlike many other LLMs, LaMDA was trained on dialogue and its conversational skills have been years in the making.

2. Developer shortages amidst the Great Resignation

Data from Salesforce’s 2022 MuleSoft survey indicates that 93% of organizations find it more difficult to retain skilled software developers, while 86% have found it more difficult to recruit them over the last two years. The churn caused by the “Great Resignation” has only compounded the problem. The top three causes contributing to developer burnout were increasing workload (39%), pressures of digital transformation (37%), and the need to adapt to new technologies and approaches (35%). Around 75% of organizations have reported that the cognitive load required to learn their software architecture was so high that it was a source of discontent and low productivity for developers. Notably, 91% of organizations emphasized that they needed solutions that automated key processes for developers, so that they could “do more with less.” Fine-tuned language models like OpenAI’s Codex can be leveraged to build time-saving AI pair-programming tools with the potential to save time and effort, alleviating the burden on software developers.

3. Demand for automation and autonomy 

Growing demand for automation and autonomy across sectors will continue to fuel demand for the underlying computer vision and multimodal models that enable machines to mimic human capabilities and function alongside humans. A 2020 MIT study revealed that four manufacturing sectors in the US currently account for 70% of robots—automakers (38% of robots in use), electronics (15%), the plastics and chemicals industry (10%), and metals manufacturers (7%). As the global manufacturing sector steadily marches toward industry 4.0, these applications can be expected to grow. Propelled by demand for professional service robots in applications like healthcare and consumer services, progress in fields like humanoid tech may also drive the search for the elusive sentient AI. 
Meanwhile, automotive applications ranging from advanced driver assistance systems to autonomous vehicles will continue to drive demand for multimodal AI models capable of harnessing and processing a variety of visual inputs. For instance, Google’s Perceiver can currently take in three kinds of inputs—images, videos, and LiDAR point clouds from auto-driving scenes.

4. Demand for predictive insights and analytics in healthcare and beyond

The deep learning advances made by large AI models when it comes to natural language tasks, computer vision, and multimodal learning have implications for many fields that require predictive insights and analytics. In particular, there is evidence to highlight the value of these AI models for medical imaging diagnostic tasks for diabetic retinopathy, breast cancer pathology, lung cancer CT scan interpretation, and dermatology. Meta, for instance, has leveraged the same self-supervised learning methods that underlie its SEER vision model to develop three machine learning models that help predict Covid-19 resource needs. These models study a series of patient x-rays and help doctors predict how a patient’s condition may develop, helping hospitals with resource planning. Other fields that have reportedly benefited from such deep-learning approaches include quantum chemistry, earthquake prediction, flood forecasting, genomics, protein folding, high energy physics, and agriculture. 

How do these models compare?

A defining characteristic of large AI models is that they can typically perform their tasks with high levels of accuracy across various benchmarks, which differ based on the type of model. Some of these models are also evaluated based on their performance and accuracy across the modalities of zero-shot, one-shot, and few-shot learning.
The performance comparison in this section is limited to key LLMs, computer vision models, and multimodal models. It excludes fine-tuned language models and DLRMs, of which OpenAI’s Codex and Facebook’s DLRM are the largest and best-in-class, respectively. 

Large language models (LLMs)

Below, we have compared the most notable large language models on the LAMBADA benchmark, an open-ended cloze task that requires models to predict a missing target word in the last sentence of around 10,000 passages from BookCorpus. LAMBADA is only one of ~29 potential comparison metrics for LLMs, but it represents a key benchmark for research on models of natural language understanding. When the LAMBADA benchmark was first introduced in 2016, none of the state-of-the-art language models reached over 1% accuracy.

LAMBADA accuracy across LLMs

Computer vision models

Computer vision models are benchmarked based on their performance in image classification problems like the ImageNet validation dataset. Performance is tracked based on top-1 classification accuracy or conventional accuracy, which measures how often the predicted data label matches the single target label. 
Google’s vision transformer Vit G/14 is ahead, with 90.45% top-1 accuracy on ImageNet. Comparatively, Meta’s SEER has achieved 84.2% top-1 accuracy on ImageNet.

Multimodal models

Multimodal models focused on text-to-image generation are often measured on the COCO benchmark—a large-scale object detection, segmentation, and captioning dataset. Meanwhile, the FID scores achieved on the COCO benchmark represent the quality of images created by generative models (a lower FID indicates better-quality images).
For a while, OpenAI’s DALL-E was the frontrunner among large multimodal models but it was recently unseated by Google’s text-to-image models. Google’s Parti and Imagen currently outperform OpenAI’s DALL-E and DALL-E 2 on the COCO benchmark, achieving better (i.e., lower) zero-shot FID scores.

FID scores on COCO benchmark


Limitations 

 1. Cost of training and running large AI models

All large AI models have steep development costs. According to a 2020 study by AI21 labs, a text-generating model with 1.5 billion parameters can require as much as USD 1.6 million to train. By contrast, the current generation of large AI models has hundreds of billions of parameters and are trained on specialized hardware like graphics processor units, driving up costs even further. For instance, BigScience’s BLOOM was trained on 384 Nvidia Tesla A100 GPUs that cost ~USD 32,000 each.
There is also the hardware cost of inference (i.e., running the trained model). The compressed BLOOM model is 227 GB. Running it requires specialized hardware with hundreds of GB of VRAM. In comparison, GPT-3 requires a computing cluster that is the equivalent of Nvidia DGX 2, which is priced at around USD 400,000. Alternatively, if run on the cloud, inference for a model like GPT-3 can run up to a minimum annual cost of USD 87,000 on a single AWS instance.
The cost burden goes beyond the financial. The GPUs that AI models are trained on draw more power than traditional CPUs. Based on a life cycle assessment for training several common large AI models, researchers at the University of Massachusetts Amherst discovered that the process can emit over 626,000 pounds of carbon dioxide equivalent. In effect, training a single large AI model generates nearly five times the lifetime carbon emissions of the average American car (including its manufacture). Without a switch to 100% renewable energy sources, AI progress may come at the expense of carbon neutrality.

2. Time and memory-intensive training process

Training large AI models is challenging for many reasons beyond cost. For starters, the parameters of these models do not fit in the memory of even the largest GPUs. Moreover, the large number of compute operations needed to refine these models can also result in long training times, unless the algorithms, software, and hardware are optimized together. Training Microsoft’s 530-billion-parameter MT-NLG model in a reasonable timeframe required the convergence of Nvidia A100 Tensor Core GPUs and HDR InfiniBand networking, state-of-the-art supercomputing clusters such as the Nvidia Selene, and Microsoft Azure NDv4.

3. Ethical and social risks

The ethical and social risks of AI models are areas of ongoing investigation and concern. AI ethicist and former Google exec Timnit Gebru has detailed the challenges associated with large language models in a notable paper on the “dangers of stochastic parrots.” Deep Mind has also outlined the six specific risks of LLMs in particular, including the risks of discrimination as well as representational and material harm by perpetuating stereotypes and social biases. Essentially, since these AI models feed on data from the internet, they mimic the inherent stereotypes, hate speech, misinformation, and toxic narratives found in this data. For instance, researchers have documented persistent biases in large language models like GPT-3, where the word “Muslim” has been associated with the word “terrorist” in 23% of test cases, while “Jewish” has been mapped on to “money” in 5% of cases. Moreover, most online data is from young users in developed countries, with a strong skew toward male users aged between 18 and 29 years. Given the scale of the training data, the lack of data from under-represented groups creates a feedback loop that further lessens the impact of data received from these populations.
These challenges are not limited to LLMs. OpenAI has also noted that image-generating multimodal AI models like DALL-E can similarly pick up on the biases and toxicities embedded in the training images from the web. When prompted with a term like “lawyers,” for instance, DALL-E generated a series of images of white-passing men. Meanwhile, a term like “flight attendant” led DALL-E to generate a series of images of women with East-Asian features. The risk of misuse has led some big tech players to hold off on releasing access to their models until frameworks for responsible externalization have been explored. For instance, Google has opted against open sourcing code or demos for its image generation models Parti and Imagen

What’s next?

Open-source AI may eventually challenge the stranglehold of big tech: Thus far, large AI models have been the sole purview of big tech incumbents and AI laboratories with the resources to train and run these models. BigScience’s 2022 release of BLOOM marks a departure from this trend. The model was funded by grants from French research agencies CNRS and GENCI and is the product of an open-source AI collective comprising over 1,000 researchers across ~70 countries and ~250 institutions. It is reportedly the largest collaboration of AI researchers on a single research project and the first multilingual LLM to be trained in complete transparency. Bloom is completely open source and Huggingface has made their full (as well as some smaller) pre-trained models available to the public via their transformers’ API. It remains to be seen how these open-source models fare against the big tech models that are largely internal or have restricted access to closed beta testers.
Bigger may not always be better: Given the steep development costs and massive carbon footprints involved, the trend of larger and larger AI models may not prove sustainable in the long run. Moreover, larger models do not necessarily lend themselves to better performance, especially if these models are undertrained. Instead, compute-optimal models may be an alternative way forward. Google’s Chinchilla, a 70-billion-parameter model, is a case in point. Although four times smaller than Google’s Gopher, it was trained on four times more data. Researchers found that Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG across several language benchmarks. Chinchilla’s zero-shot LAMBADA score, for instance, outpaces its peers and is only slightly behind that of PaLM (see performance comparison section). The impending release of GPT-4 appears to confirm the view that bigger may not always be better. OpenAI CEO Sam Altman has indicated that GPT-4 will not be any bigger than GPT-3, but will use more compute resources, as they focus on getting the most out of smaller models, including a stronger focus on coding applications.
Commercial applications may take time: Some large AI models have come out ahead in terms of commercialization (e.g., OpenAI’s GPT-3 and Codex). However, the same outcomes may take longer for other models to achieve, particularly where further R&D is required to explore potential use cases and mitigate the risks of misuse. Google has, for instance, revealed that it has no plans to release Parti models, code, or data for public use without careful model bias measurement and mitigation strategies. 
More research needed: Much work is yet to be done to develop machines that can truly mimic human capabilities. While the current generation of large AI models shows great promise, further research and exploration are necessary to iron out the kinks. The ethical and social concerns associated with large models are valid and demand attention. As a 2022 article by Scientific American put it: “What we really need right now is less posturing and more basic research.”

Appendix: Functionality by model

Contact us

Gain access to all industry hubs, market maps, research tools, and more
Get a demo
arrow
menuarrow

By using this site, you agree to allow SPEEDA Edge and our partners to use cookies for analytics and personalization. Visit our privacy policy for more information about our data collection practices.