Dec 24, 2020

A couple quick things I can think of:

- Voice transcription

- Tesla Autopilot

- Facial recognition (photo sorting on iphones, better photos)

- Better graphical performance on Nvidia cards (https://developer.nvidia.com/dlss), also better compression for streaming.

- Much better translation

- Colorizing and repairing old photos

- Visual recognition allowing better search of images

I’m sure there are some I left out. I think we’ll see a lot more interesting applications (particularly around tooling) in the next few years.

https://medium.com/@karpathy/software-2-0-a64152b37c35

Outside of the consumer space, there are also things that hint at more generalizable intelligence.

Check out GPT-3’s performance on arithmetic tasks in the original paper (https://arxiv.org/abs/2005.14165)

Pages: 21-23, 63

Which shows some generality, the best way to accurately predict an arithmetic answer is to deduce how the mathematical rules work. That paper shows some evidence of that and that’s just from a relatively dumb predict what comes next model.

It’s hard to predict timelines for this kind of thing, and people are notoriously bad at it. Nobody would have predicted the results we’re seeing today in 2010. What would you expect to see in the years leading up to AGI? Does what we’re seeing look like failure?

https://deepmind.com/blog/article/muzero-mastering-go-chess-...

Oct 07, 2020

Can you please explain to me (I don't understand it) why do you think GPT-3 is closed? Yes, they won't share the trained model, but they're sharing the research here[0][1] so you can reproduce easily, aren't they? As I understand it now, it's very fair - training the model is a separate thing from doing (and sharing) the research, is very costly, and would not happen if they were forced to open that too - I also don't understand why should they be.

[0] https://arxiv.org/abs/2005.14165

[1] https://github.com/openai/gpt-3

Sep 11, 2020

I think mixing in the human bit confuses the issue, you could have a goal oriented AGI that isn't human like that causes problems (paperclip maximizer).

Check out GPT-3’s performance on arithmetic tasks in the original paper (https://arxiv.org/abs/2005.14165)

Pages: 21-23, 63

Which shows some generality, the best way to accurately predict an arithmetic answer is to deduce how the mathematical rules work. That paper shows some evidence of that.

> evolution has had almost a billion years to arrive at complex brains

There are brains everywhere and evolution is extremely slow. Maybe the large computational cost of training models is similar to speeding that computation up?

> there is no obvious way to go from approximately human-level AGI to the kinds of sci-fi super-super-human AGIs that some AI catastrophists imagine.

It's worth reading more about the topic, it's less that we'll have some human comparable AI and then be stuck with it - more so that things will continue to scale. Stopping at human level might be a harder task (or even getting something that's human like at all).

> This is all not to mention that we have no way right now of tackling the problem of teaching the vast amounts of human common sense knowledge that is likely baked into our genes to an AI, and it's hard to tell how much that will impact true AGI.

This is a good point and basically the 'goal alignment problem' or 'friendly AI' problem. It's the main reason for the risk since you're more likely to get a powerful AGI without these 'common sense' human intuitions of things. I think your mistake is thinking the goal alignment is a prerequisite for AGI - the risk comes from the truth being that it's not. Also humans aren't entirely goal aligned either, but that's a different issue.

I understand the skepticism, I was skeptical too - but if you read more about it (not pop-sci, but the books from the people working on the stuff) it's more solid than you probably think and your positions on it won't hold up.

Sep 09, 2020

Not exactly, GPT-3 uses a variant of BPE [1], so one token can correspond to a character, an entire word or more, or anything in between. The paper [2] says a token corresponds to 0.7 words on average.

[1] https://en.wikipedia.org/wiki/Byte_pair_encoding

[2] https://arxiv.org/abs/2005.14165, page 24

Sep 04, 2020

Bear with me, I am a ML amateur. What is this then? https://arxiv.org/abs/2005.14165

Aug 26, 2020

That's a fair criticism, original paper would be better.

I linked to the blog because that's where I first read about this and it's more immediately accessible.

[Edit: Paper Link, https://arxiv.org/abs/2005.14165]

In fact after reading these sections of the actual paper, it's hard to believe that you could have read it yourself and taken away the idea that it was memorization? Pages 22-23 in particular. Part of the reason I tend to link to good blog posts instead of academic papers is when you link to papers nobody reads them. Often people linking them haven't read them either (I've only read small parts).

Aug 26, 2020

This (GPT-3’s performance on arithmetic tasks) is covered in the original paper (https://arxiv.org/abs/2005.14165)

Pages: 21-23, 63

Aug 22, 2020

I'm getting impatient with criticisms of ML models that are already covered in the papers introducing the models. OP is basically trying to get it to do what the GPT3 paper calls zero-shot inference. In the paper, it's pretty bad at zero shot inference across the board. And given what it does and how it was trained, that's unsurprising. And the point they're trying to make (that it can fail spectacularly) is also covered in the paper.

It can do cool shit. It sucks at a lot of stuff. It's impressive and limited, but the hype train seems to only allow "it's nearly human level" or "it's awful." To everybody who is arguing about its capabilities without having read the paper yet, please read it. Then we can discuss stuff that hasn't already been covered more rigorously in the original paper. I don't know Davis, but I respect Marcus, and it seems like he's pushing back on the hype more than the actual model. Just not in a way that you couldn't glean from the paper itself (it almost always sucks on zero-shot), making it pretty disingenuous. Further, from the paper [0]:

> it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks.

Maybe that's the curse of doing a thing that has broad implications. You can't fit the implications in a 10 page paper, so you write a 75 page paper. The blogosphere reads the first 10 pages (if even that), and because there's so much more to it that that introduction, they go on to argue about the rest of the implications without reading it. I'm sure Marcus and Davis have read it, but this criticism wouldn't be on the front page if the rest of everyone interested in this article had read the paper too.

[0] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

Aug 22, 2020

The article is a critical view on GPT-3. Fair. It is well known that Gary Marcus is not a fan of the GPT kind of systems. And he does make some valid points. If you want to look at a better balanced view it actually helps to look at all their prompts [1].

That said, I think it's more of a hype that GPT-3 is moving towards AGI. The actual GPT-3 paper says "Language Models are Few-Shot Learners"[2]. So it's actually surprising that no one has actually done a real analysis of this. Are they really few shot learners? My experiments seem to suggest otherwise.

But for sure, GPT-3 is the best general purpose natural language system out there in the world. I don't think anyone can say otherwise.

[1]https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h... [2]https://arxiv.org/abs/2005.14165

Aug 16, 2020

GPT-3 can't do arithmetic very well at all. There is a big, fat, extraordinary claim that it can in the GPT-3 paper but it's only based on perfect accuracy on two-digit addition and subtraction, ~90% accuracy on three digit addition and subtraction and ... around 20% accuracy on addition and substraction between from three to five digits and multiplication from between two measly digits. Note: no division at all and no arithmetic with more than five digits. And very poor testing to ensure that the solved problems don't just happen to be in the model's training dataset to begin with, which is the simplest explanation of the reported results given that the arithmetic problems GPT-3 solves correctly are the ones that are most likely to be found in a coprus of natural language (i.e. two- and three- digit addition and subtraction).

tl;dr, GPT-3 can't do basic math for problems it has not been directly trained on.

____________

[1] https://arxiv.org/abs/2005.14165

See section 3.9.1 and Figure 3.10. There is an additional category of problems of combinations of addition, subtraction and multiplication between three single-digit numbers. Performance is poor.

Jul 27, 2020

While models such as XLNet incorporate recurrence, GPT-{2,3} is mostly just a plain decoder-only transformer model.[1]

[1]https://arxiv.org/abs/2005.14165 [2]https://d4mucfpksywv.cloudfront.net/better-language-models/l...

Jul 27, 2020

>> But GPT-3 is much more successful, including at giving correct answers to arithmetic problems that weren't in its training set.

That's not exactly what the GPT-3 paper [1] claims. The paper claims that a search of the training dataset for instances of, very specifically, three-digit addition, returned no matches. That doesn't mean there weren't any instances, it only means the search didn't find any. It also doesn't say anything about the existence of instances of other arithmetic operations in GPT-3's training set (and the absence of "spot checks" for such instances of other operations suggests they were, actually, found- but not reported, in time-honoured fashion of not reporting negative results). So at best we can conclude that GPT-3 gave correct answers to three-digit addition problems that weren't in its training set and then again, only the 2000 or so problems that were specifically searched for.

In general, the paper tested GPT-3's arithmetic abilities with addition and subtraction between one to five digit numbers and multiplication between two-digit numbers. They also tested a composite task of one-digit expressions, e.g. "6+(4*8)" etc. No division was attempted at all (or no results were reported).

Of the attempted tasks, all than addition and subtraction between one to three digit numbers had accuracy below 20%.

In other words, the only tasks that were at all successful were exactly those tasks that were the most likely to be found in a corpus of text, rather than a corpus of arithmetic expressions. The results indicate that GPT-3 cannot "perform arithmetic" despite the paper's claims to the contrary. They are precisely the results one should expect to see if GPT-3 was simply memorising examples of arithmetic in its training corpus.

>> So what changed? We aren't sure, but the speculation is that in the process of training, GPT-3 found that the best strategy to correctly predicting the continuation of arithmetic expressions was to figure out the rules of basic arithmetic and encode them in some portion of its neural network, then apply them whenever the prompt suggested to do so.

There is no reason why a language model should be able to "figure out the rules of basic arithmetic" so this "speculation" is tantamount to invoking magick.

Additionally, language models and neural networks in general are not capable of representing the rules of arithmetic because they are incapable of representing recursion and universally quantified variables, both of which are necessary to express the rules of arithmetic.

In any case, if GPT-3 had "figure(d) out the rules of basic arithmetic", why stop at addition, subtraction and multiplication between one to five digit numbers? Why was it not able to use those learned rules to perform the same operations with more digits? Why was it not capable of performing division (i.e. the opposite of multiplication)? A very simple asnwer is: GPT-3 did not learn the rules of arithmetic.

_________

[1] https://arxiv.org/abs/2005.14165

Jul 27, 2020

You don't have to wonder. In their paper: https://arxiv.org/abs/2005.14165 they state it has 0.7% accuracy on zero shot 5 digit addition problems and 9.3% accuracy on few shot 5 digit addition problems.

Jul 20, 2020

Section 3.3 of the paper[0] covers this. 93% of the training data was English, and the rest other languages. (German, French and a long tail of other languages.) This was not training data specifically for translation, but a natural mix of language as it appears in some documents.

With a few prompts to explain the translation task, GPT-3 is claimed to perform well on certain translation tasks to English. (It was not as good as the state of the art in the other direction.)

[0] https://arxiv.org/abs/2005.14165

Jul 20, 2020

Oh and note that this:

>> GPT-3 can also finally do arithmetic, something GPT-2 was unable to do well.

Is a preposterous claim that is very poorly supported by the data in the GPT-3 paper [1]. Figure 3.10 in the paper summarises the results. The authors tested addition between two to five digits, subtraction between two to five digits and multiplication between two digits drawn uniformly at random from [0,100]. There was also a composite task of addition, subtraction and multiplication with single-digit numbers (e.g. 6+(4*8), etc).

On all tasks, other than two and three digit addition and subtraction, accuracy was uniformly under 20%. The other four tasks achieved high accuracy with more parameters.

Of course, this doesn't show that the larger models "learned arithmetic". Two- and three-digit addition and subtraction are likely to be much better represented in a natural language dataset than other operations (and note of course the conspicuous absence of division). So it's safe to assume that the model has seen all the operations it's asked to repeat and knows their results by heart. Remember that for two and three digit addition and subtraction one only needs a dataset with the numbers up to 999, which is really tiny and easy to memorise.

Edit: the authors note that they "spot checked" whether the model is simply memorising results, by searching for three-digit addition examples in their dataset. Out of 2000 three-digit addition problems they failed to find more than 17% in their dataset which "suggests" that the model had not ever seen the problems before. Or, it "suggests" the search was not capable of finding many more existing matches. In any case, why only "spot-check" three-digit addition? Who knows. The paper doesn't say. Certainly, one- and two-digit addition and subtraction should be much more common in a natural language dataset. The authors also say that the model often makes mistakes such as not carrying a one, so it must be actually performing arithmetic! Or, it's simply reproducing common arithmetic mistakes in its dataset. Overall, this sort of "testing" of arithmetic prowess simply doesn't cut the mustard.

Edit 2: Also, no information about how many arithmetic problems of each type were tried. One? Ten? One hundred? Where all arithmetic tasks tested with the same number of problems? Unknown.

_____________

[1] https://arxiv.org/abs/2005.14165

Jul 18, 2020

I think that it’s the opposite. This algorithm requires many examples of text on the specific topic. Probably more than most humans would require.

> While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples [0]

I don’t know what constitutes an example in this case but let’s assume it means 1 blog article. I don’t know many humans that read thousands or tens of thousands of blog articles on a specific topic. And if I did I’d expect that human to write a much more interesting article.

To me, this and other similar generated texts from OpenAI feel bland / generic.

Take a listen to the generated music from OpenAI - https://openai.com/blog/jukebox/. It’s pretty bad, but in a weird way. It’s technically correct - in key, on beat, ect. And even some of the music it generates is technically hard to do, but it sounds so painfully generic.

> All the impressive achievements of deep learning amount to just curve fitting Judea Perl [1]

This comment was written by a human :)

[0]https://arxiv.org/abs/2005.14165 [1]https://www.quantamagazine.org/to-build-truly-intelligent-ma...

Jun 14, 2020

This is changing. In natural language processing just in the past couple weeks OpenAI wrote about their GPT3 model which can learn some tasks remarkably quickly, after only 1 or 2 examples. That model has extreme compute requirements, but it shows strong progress on performing never before seen tasks. There is still some steam left in the current deep learning boom. https://arxiv.org/abs/2005.14165

May 28, 2020

paper: https://arxiv.org/abs/2005.14165

abstract:

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.