Foresight Nanotech Institute Logo
Image of nano

Associative memories

AI researchers in the 80s ran into a problem: the more their systems knew, the slower they ran.  Whereas we know that people who learn more tend to get faster (and better in other ways) at whatever it is they’re doing.

The solution, of course, is: Duh. the brain doesn’t work like a von Neumann model with an active processor and passive memory.  It has, in a simplified sense, a processor per fact, one per memory.  If I hold up an object and ask you what it is, you don’t calculate some canonicalization of it as a key into an indexed database. You compare it simultaneously to everything you’ve ever seen (and still remember).  Oh, yeah, that’s that potted aspidistra that Aunt Suzie keeps in her front hallway, with the burn mark from the time she …

The processing power necessary to to that kind of parallel matching is high, but not higher than the kind of processing power that we already know the brain has.  It’s also not higher than the processing power we expect to be able to throw at the problem by 2020 or so.  Suppose it takes a million ops to compare a sensed object to a memory.  10 MIPS to do it in a tenth of a second.  A modern workstation with 10 gigaops could handle 1000 concepts. A GPGPU with a teraops could handle 100K, which is still probably in the hypohuman range.  By 2020, a same priced GPGPU could do 10M concepts, which is right smack in the human range by my best estimate.

Associative memory gets you a lot.  You don’t have to parse an unknown object for algorithmic retrieval.  You don’t have to come with some one-size-fits-all representation and/or classification scheme.  Indeed, each object in memory can have its own representation if necessary or useful.

It gets better.  The memories aren’t all, or even mostly, objects.  They’re typically actions.  Let’s suppose the actions are represented as situation-action-resulting situation triples — something like Minsky’s trans-frames.  Then we can use the associative memory to

  • recognize, as described above
  • predict: search on the situation and action; the prediction is the result in the best match
  • plan: match on situation and desired result; do the action from the best match
  • generalize: every time a was done, b happened
  • model: by chaining predictions, etc

There was an attempt to do this kind of thing in mainstream AI under the name “case-based reasoning” a couple of decades ago, but it appears to have foundered for several reasons, not least of which was the inability to do heavy-duty parallel matching on extensive memory sets.

There are a number of things that need to be added to the scheme for it to be useful and robust, like embedding it in a hierarchical, multiagent architecture, the ability to do analogical quadrature, and the ability to find useful representations.  But that’s for another post.

12 Responses to “Associative memories”

  1. Fred Hapgood Says:

    There is something weird about the argument that we have to wait until 2020 to test the hypothesis that the main obstacle to getting to AI is that we don’t have enough cycles. Tiny little animals with tiny little brains do way better with pattern processing than mainframes that probably — I haven’t run the numbers — have
    more than enough cycles for equivalence. Machines with the pattern recognition skills of bats and rats and ravens would have an immense market, so I don’t think the problem is lack of investment. And I suspect that we are a lot further along with reverse engineering the brains of small vertebrates than we are those of humans.

    Before I buy the idea that cycles are the royal road of human-level AI I would like to see the idea validated with pigeon-level AI. That would be plenty dramatic enough for a few years. To sum: I suspect that our time would be far better spent working on the problem of rat-level AI than the much harder problem of human-level AI. Besides we are almost certainly going to find that solving the first problem is critical to solving the second.

  2. TheRadicalModerate Says:

    You’re never going to get the right scaling (and hence a decent prediction of when various things will be possible) if you just deal with the problem as “more MIPS=higher capability”. The problem isn’t processing power; it’s memory bandwidth.

    Modern computing systems manage to achieve improved performance by clever, multi-level caching schemes that can take advantage of physical and temporal locality of reference. As long as you can whack on data sets that can be organized to be adjacent to one other, and as long as you can whack on them at the same time, before moving on to other data sets, you can play all kinds of games with your memory architecture and achieve near-linear computational scaling as MIPS increase.

    But the kinds of associative memory you’re talking about have lousy physical/temporal locality. Instead, you’ve got meshes of AMems whose extremely large activation vectors (10^4 to 10^6 bits) produce other extremely large output vectors, which in turn cascade into other AMems, and so on, until the whole system relaxes into something resembling a stable state.

    You can certainly engineer AIs without neural-like mechanisms, but the amount of information that needs to be processed per neural “cycle” (~100-300 ms for cognitive functions, much less than that for motor control and some perceptual tasks–let’s say 100 ms on average) is still going to be on the order of 10^12 to 10^14 “synapses”. That’s 10^12 to 10^14 synapses * ~4 memory fetches per synapse * 10 cycles per second, all effectively uncached, or 4E13 to 4E14 memory fetches per second or, assuming an 8-bit data precision, 3E14 to 3E15 bps. That’s still 4 or 5 orders of magnitude below the peak transfer rate of today’s state of the art memory system. That’s not going to be a problem for CPU design, but I’m a lot less sanguine that we’re going to get there with current memory and bus architectures, Moore’s Law or no.

    The breakthrough that’s needed is one of going from extremely narrow bus architectures, which today have ~1000 connections to less than 100 conductive pathways, to massively connected systems with 10^3 or 10^4 conductive pathways and 10^7 or 10^8 connections. That in turn implies architectures with at least 10^2 or 10^3 processing elements, each with enough working memory to be useful.

    This is something that we fundamentally don’t know how to do yet. It’s an extremely difficult manufacturability and reliability hoop to jump through. This is the gating problem for producing a meso-human AI.

  3. James Gentile Says:

    Funny enough I disagree again. You mention GPUs of 2020, but neglect to mention (again) that supercomputers will be at that level in the next couple of years. Also again I think less than human AI (rats, birds, spiders, etc.) are utterly useless, but that’s just my opinion.

  4. miron Says:

    TheRadicalModerate – FPGAs and Cell processors are examples of how to integrate RAM with CPUs, but we don’t do it for personal computing yet. Intersperse CPUs with RAM on the silicon die and you should get all the memory bandwidth you need.

    I believe the average axon length is 10mm. That reaches an area that is less than 1e-4 of the total cortex surface area (~1.6 m^2). That’s pretty far from needing all that memory bandwidth globally.

    So yes, I do agree that memory bandwidth will be the bottleneck rather than CPU, but I think we can easily build adequate solutions.

  5. TheRadicalModerate Says:

    @miron–

    The problem isn’t the memory itself; it’s the wires. Wires are the bane of the electronics industry. They’re expensive, hard to fabricate, hard to test, and they break easily. You can have all the FPGAs you want, but you can only present vectors to them serially, and even then, the gate array’s logic is bounded by how many simultaneous inputs it has.

    My point was merely that the usual way of moving data around in a computer breaks down under the amount of parallel data needed for proper hierarchical pattern recognition. You can’t trade the number of wires you need for properly parallel data processing simply by emulating the parallelism with really fast serialized buses. It just doesn’t scale.

  6. Nicole Tedesco Says:

    I agree with Fred — we will need to learn to crawl before we can learn to run. In terms of the bandwidth-versus-cycles debate we don’t necessarily need bandwidth, or cycles, in order to prove any specific concept as long as we don’t mind the computation happening slowly. Of course the problem is that some things, like factoring very large numbers, currently happen too slowly to complete in our life times. Though the number of synaptic connections in our nervous system seems may be extremely difficult to implement in our models, we may not actually need them to improve on the pattern recognition and machine learning capabilities we have today. What are all those connections used for, anyway? Do all of them participate in information transmission? How much of the biological neural circuitry is involved in phase synchronization? How much of it is simply redundant in order to make up for the noisy environment of a “natural” brain? Understand that modern computers use as much heat as they do because they are designed to be as error-free as possible. Removing sources of computational error, which modern computers are really good at, may go a long way to reducing the number of neural components that need to be implemented in our neural models.

  7. Nicole Tedesco Says:

    Oh, I forgot to meantion: phase-synchronized computation is something that modern computers are really good at by default. So consider how much of our lack of cycles and bandwidth are made up for built-in phase synchronization capabilities and in the avoidance of error (which need not be corrected for).

  8. michael edelman Says:

    It’s not just enough to have a neural network- you have to know how it works, and the brain doesn’t work like most high-level neural net systems AI people are familiar with. It’s a massively parallel pattern matcher. Connections? Yes, we estimate up to 10E14 synapses, but that still doesn’t tell the whole story. There are neurotransmitters like nitric oxide that don’t act on a single synapse, but diffuse through an area. There are probably other systems we haven’t discovered yet.

    The idea of “fetches” doesn’t seem to shed any light here. The “program” of the brain, if we may call it that, can’t be written out as a single entity. The brain is made up of of a number of nucleii, each of which is a very generalized pattern-matching system that work together to model the outside world- or so it appears. Beyond that, we have very little idea of how this organizes itself into a conscious being, although there are plenty of philosophers like Dennet who will tell you it’s a simple matter ;-)

    Looking at small scale neural nets in vivo has given us some insights into how to design mechanical insects that can navigate on their own, but we’re still far from a workable model of how the brain organizes memory and knowledge of events and objects, let alone a sense of self.

  9. Pink Pig Says:

    The relationship between random and associative memories is not linear. (Why would anyone suppose that it is?) The time required for associative memory to function correctly grows approximately as the logarithm of the time required by a random-access memory. Computing power is giga these days, not mega. The real reason that software hasn’t speeded up at the same rate as hardware is twofold: 1) the software side simply doesn’t have a clue how to make software run faster; 2) the software side is lazy enough to use improvements in hardware to mask their own shortcomings. (And if that’s not enough for you, the study of Artificial Intelligence is an attempt to create something that we already have several billion of: human brains.)

    I’ve been doing software and hardware for almost 50 years now, and I know whereof I speak.

    When PCs (and similar microprocessor-based equipment) had memories of 8K or less, the operating systems fit neatly into less than 4K. Try finding an operating system that runs today in under 4K.

    The computer that I learned on had 1000 words of memory (it was a big UNIVAC I). It didn’t really have an operating system per se — what was available was part of the bootstrap, a standard practice back then. A year later, I worked with a computer (NCR200) that had all of 200 bytes of memory. The first real operating system that I saw (and worked with) was the executive developed for the Honeywell 800/1800, which was a true multiprocessor (in the late 50s/early 60s!). It ran in under 2K words (IIRC, a word on the Honeywell 800 was 48 bits wide). We also didn’t have the luxury of a standardized USB — that meant dealing with paper tape, punched cards, magnetic tape, and numerous other peripherals.

  10. miron Says:

    @TheRadicalModerate – To simulate the brain’s connectivity, you don’t need the bandwidth to be on a global bus. You need bandwidth that is inversely proportional to the distance between two points (i.e. power law).

    FPGAs, Cell processors and other interspersed memory/computation architectures have high local bandwidth (within the chip). The highest bandwidth would be required only for the average distance of an axon, which covers only 1/10,000 of the brain area. i.e. you only need to reach 1/10,000 of the other chips with the highest bandwidth.

    To achieve power law distribution for bandwidth, it would be enough to lay out the chips in a 2-D network, and to have local buses between adjacent chips. In effect, this mimics the connectivity of the brain.

  11. TheRadicalModerate Says:

    @miron–

    Agree completely that connection density probably follows a power law. But you’re still wire-limited at the chip, printed circuit board, and bus levels. As the number of wires goes up, your IC yields go down. Same story with your PCBs. And your backplane is just a nightmare.

    So you may be able to lop a couple of orders of magnitude off of the problem due to locality, but that still leaves 2-3 orders of magnitude left to go, with a lot of it at the longer connection distances, where bandwidth necessarily drops due to noise, capacitance, etc.

    @Michael–

    We certainly don’t know enough today to build AIs out of hierarchical neural pattern matchers, but I suspect that that will change rather quickly when we can simulate neural nets that are big enough to do useful work. And that’s sort of a chicken-and-egg problem, isn’t it? No platform to do interesting neural AI on because there’s no software, no software because there’s no platform. Bottom line is that very little is going to happen until we get some kind of a fab breakthrough–which was more-or-less my original point.

    I also agree that we’re unlikely to get genuine neural processing at the neuron level–way too many processing units. However, there’s a lot of territory in between one processor per neuron and all neurons simulated on a single processor. We know enough about neural activation functions to write interesting algorithms. Those algorithms can be simulated millions of times faster than the time it takes a real neuron to fire, so millions of neuron simulations per processor are possible. You are also correct that any accurate neural simulation must also simulate the funny chemicals being squirted out globally, but that’s a pretty minor complicating factor to discovering interesting algorithms.

    The most interesting problem to be solved is to figure out how pattern-matching hierarchies self-organize to achieve cognition. There’s a genetic component involved here: Lots of white matter projections grow during development, both from the limbic system and brain stem to the cortex, as well as from place to place inside the cortex itself. But once this rough architecture is in place, plasticity has to prune connections and learn appropriate connection weights. Even more important, something’s going on that must automatically allocate the size and functions of the network of pattern recognizers. But that algorithm has to be really, really simple. Its discovery will be a key enabler for using neural nets for more cognitive tasks.

  12. miron Says:

    @TheRadicalModerate – There are interconnects available at around 1e10 to 1e11 bytes per second. This includes AMD HyperTransport and AMD Horus. With 10,000 chips per system and 20,000 buses you would be at about the right bandwidth.

    A cluster of off the shelf boxes has about 160 chips with 320 HyperTransport buses. So we’re about a factor of 60 away from having the right bandwidth in one rack.

    You’d definitely need a custom solution that is communication bandwidth oriented to get the right bandwidth per chip. My guess that the bandwidth/chip gap will be closed by the time the memory/chip gap is (10-15 years).

    I don’t understand your comment about bandwidth dropping at longer distances. First of all, you don’t need that much, due to locality. Second, you can just route through local interconnects by using multiple hops. There’s plenty of time in a “neuronal cycle” (10ms?) to cross even 100 hops. Basically, you don’t need anything more than connections to immediate neighbors.

Leave a Reply