Hacker Newsnew | past | comments | ask | show | jobs | submit | bas's favoriteslogin

Okay, here's my attempt!

First, we take a sequence of words and represent it as a grid of numbers: each column of the grid is a separate word, and each row of the grid is a measurement of some property of that word. Words with similar meanings are likely to have similar numerical values on a row-by-row basis.

(During the training process, we create a dictionary of all possible words, with a column of numbers for each of those words. More on this later!)

This grid is called the "context". Typical systems will have a context that spans several thousand columns and several thousand rows. Right now, context length (column count) is rapidly expanding (1k to 2k to 8k to 32k to 100k+!!) while the dimensionality of each word in the dictionary (row count) is pretty static at around 4k to 8k...

Anyhow, the Transformer architecture takes that grid and passes it through a multi-layer transformation algorithm. The functionality of each layer is identical: receive the grid of numbers as input, then perform a mathematical transformation on the grid of numbers, and pass it along to the next layer.

Most systems these days have around 64 or 96 layers.

After the grid of numbers has passed through all the layers, we can use it to generate a new column of numbers that predicts the properties of some word that would maximize the coherence of the sequence if we add it to the end of the grid. We take that new column of numbers and comb through our dictionary to find the actual word that most-closely matches the properties we're looking for.

That word is the winner! We add it to the sequence as a new column, remove the first-column, and run the whole process again! That's how we generate long text-completions on word at a time :D

So the interesting bits are located within that stack of layers. This is why it's called "deep learning".

The mathematical transformation in each layer is called "self-attention", and it involves a lot of matrix multiplications and dot-product calculations with a learned set of "Query, Key and Value" matrixes.

It can be hard to understand what these layers are doing linguistically, but we can use image-processing and computer-vision as a good metaphor, since images are also grids of numbers, and we've all seen how photo-filters can transform that entire grid in lots of useful ways...

You can think of each layer in the transformer as being like a "mask" or "filter" that selects various interesting features from the grid, and then tweaks the image with respect to those masks and filters.

In image processing, you might apply a color-channel mask (chroma key) to select all the green pixels in the background, so that you can erase the background and replace it with other footage. Or you might apply a "gaussian blur" that mixes each pixel with its nearest neighbors, to create a blurring effect. Or you might do the inverse of a gaussian blur, to create a "sharpening" operation that helps you find edges...

But the basic idea is that you have a library of operations that you can apply to a grid of pixels, in order to transform the image (or part of the image) for a desired effect. And you can stack these transforms to create arbitrarily-complex effects.

The same thing is true in a linguistic transformer, where a text sequence is modeled as a matrix.

The language-model has a library of "Query, Key and Value" matrixes (which were learned during training) that are roughly analogous to the "Masks and Filters" we use on images.

Each layer in the Transformer architecture attempts to identify some features of the incoming linguistic data, an then having identified those features, it can subtract those features from the matrix, so that the next layer sees only the transformation, rather than the original.

We don't know exactly what each of these layers is doing in a linguistic model, but we can imagine it's probably doing things like: performing part-of-speech identification (in this context, is the word "ring" a noun or a verb?), reference resolution (who does the word "he" refer to in this sentence?), etc, etc.

And the "dot-product" calculations in each attention layer are there to make each word "entangled" with its neighbors, so that we can discover all the ways that each word is connected to all the other words in its context.

So... that's how we generate word-predictions (aka "inference") at runtime!

By why does it work?

To understand why it's so effective, you have to understand a bit about the training process.

The flow of data during inference always flows in the same direction. It's called a "feed-forward" network.

But during training, there's another step called "back-propagation".

For each document in our training corpus, we go through all the steps I described above, passing each word into our feed-forward neural network and making word-predictions. We start out with a completely randomized set of QKV matrixes, so the results are often really bad!

During training, when we make a prediction, we KNOW what word is supposed to come next. And we have a numerical representation of each word (4096 numbers in a column!) so we can measure the error between our predictions and the actual next word. Those "error" measurements are also represented as columns of 4096 numbers (because we measure the error in every dimension).

So we take that error vector and pass it backward through the whole system! Each layer needs to take the back-propagated error matrix and perform tiny adjustments to its Query, Key, and Value matrixes. Having compensated for those errors, it reverses its calculations based on the new QKV, and passes the resultant matrix backward to the previous layer. So we make tiny corrections on all 96 layers, and eventually to the word-vectors in the dictionary itself!

Like I said earlier, we don't know exactly what those layers are doing. But we know that they're performing a hierarchical decomposition of concepts.

Hope that helps!


The short answer is that when doing quantum mechanics on curved spacetime, we tend to do it semiclassically: the quantum stuff is approximated or averaged, and then that is taken as the source of classical spacetime curvature. The averaged (or whatever) value of the Higgs field at a point in spacetime, like the value of all the other fields at the same point, is then just a contribution to the stress-energy tensor at that point in spacetime.

The Einstein Field Equations [1], dropping constants and indices, and with a vanishing cosmological constant, can be written as G = T. Here G is the Einstein tensor, which describes curvature at each point, and T is the stress-energy tensor, which describes the flux of momentum-energy through each point.

Below, for ease of understanding T, let's apply Cartesian-like coordinates (really, Minkowski coordinates, which are spatial Cartesian coordinates x,y,z and time coordinate t, such that when calculating distances between two Minkowski points the two sets of coordinates differ by a sign and a constant: ds = sqrt(dx^2 + dy^2 + dz^2 - c^2dt^2) is one way of writing out the line element in a Minkowskian fashion, but below let's use ds^2 = c^2dt^2 - dx^2 - dy^2 - dz^2, which is the "mostly-minus" or "+,-,-,-" metric signature; below we'll refer to this line element as "the metric". The metric is a component of the Einstein tensor G, and typically in the Einstein Field Equations you can see the metric exploded out as its own tensor g [2].

Let's look shallowly at the stress-energy tensor. We have made a deliberate choice of metric, and of coordinate basis, and will lean a bit on the natural spacetime slicing into space and time that these particular choices give us.

T can be written as a 4 x 4 matrix with rows and columns starting at 0 and running to 3. T_{00} or shorter T_00, two zeroes subscripting the letter T, means the 0th row and 0th column; 0 in our metric above means the time direction; 1 means the x direction; 2 means the y direction; 3 means the z direction.

With our set of choices, T_00 corresponds to \gamma m_0 c^2: it is the "matter" at a point in spacetime that has come from the past and is going to the future, and is not moving in the x, y, or z direction. "Matter" is all the contributions to energy-momentum. Here that's some expectation value for each of the quantum fields at the point.

T_ii means we look at T_00, T_11, T_22, T_33. Let's look at the he lower three of these diagonals, T_ii, i != 0, so we are not looking at T_00. With all these choices made, T_11 is the flux of x momentum in the x direction. If x is "left" and "right" then we are thinking about momentum going from left to right entering the point from the right and exiting the point to the right. Again, that momentum can be photons or any other quantum field content.

T_0i and T_i0 can be thought of momentum-energy which originates or terminates at the point, arriving or departing in the i direction; more precisely, if you know some special relativity, T_00 here looks like \gamma m_0 c^2 and T_0i looks like \gamma m_0 \vec{v}_{i} c.

For completeness, the non-diagonals, T_ij, i,j ! =0, i != j are the fluxes of i momentum the j direction, or equivalently with our choices, the total momentum times the velocity in the j direction.

So, considering the middle space of three 3d spaces at time coordinates t-1, t, and t+1, at a point p the stress-energy tensor T encodes the all the quantum field values and changes thereof that contribute to the energy-momentum that is constant at p (in T_00) or which arrive and depart p in the three spacelike directions. The momentum is deposited into the "matter" at the point and adjacent (but in the future light cone) points in spacetime we are looking at, and arrives from the matter the adjacent points in spacetime in the past light cone. [4]

When taking this sort of semiclassical approach we usually write G = <T> where the angle brackets mean the expectation value for what they enclose. <T> is the source for the curvature in G, and thus the curvature does not depend on the details of what number of electron neutrinos vs photons are at the point.

This approach works very well in practice except (a) in regions that our technology has no hope of probing any time soon: the deep interiors of black holes, and the very hottest densest phase of the universe in our past and (b) where there (under our choices above) is a superposition of position of the quantum field values where the spatial separation is significant and the superposition lasts long enough to "measure" gravitationally (e.g. with a Cavendish apparatus). We will probably probe (b) directly within the next few years.

Unfortunately, the Minkowski metric we were using above is an extremely special case of no curvature. No curvature and no cosmological constant means, strictly, no matter. Fully classically, G = T, we then have T = 0 : our curvature-free spacetime is vacuum. However, semiclassically G = <T>, is only guaranteed a vacuum on average; there could be some configuration of quantum field values that are non-vacuum, and in principle they could condense, exposing us to a problem similar to (b) in the previous paragraph. We can also attack this class of problem with perturbation theory, where our metric is a background, and we perturb around it. Quantum Field Theories we're interested in are linear, so we can just sum perturbations of a quantum field; we can do the same in weak gravity -- this is how linearized gravity[3] works -- but because the Einstein Field Equations are nonlinear, we stop being able to add a perturbation of a metric to the background metric; so we run into the (a) problem where the perturbations of the quantum field are so hard to exactly match with corresponding perturbations of the background metric that nobody knows how to do it yet. But if we look again at the "exploded" Einstein Field Equations [2], then we can say g = \eta + h + O(h^2) + ..., where \eta is the background (in this case, flat spacetime) metric, and h is a field perturbations on it, O(h^n) are higher-order perturbations. As long as we can essentially ignore O(h^n), n > 1, semiclassical gravity linearizes beautifully accurately with supporting evidence from solar-system scales to binary pulsars.

Finally, when we use a different metric or a different set of coordinates, or both, the interpretations of the components of the stress-energy tensor at each point must be varied accordingly, however, in all cases the tensor value T itself determines the Einstein Tensor in all systems of coordinates and with all metrics.

More details if you like at : https://en.wikipedia.org/wiki/Semiclassical_gravity which has a good References section (Birrell and Davies is former gold standard, but Leonard Parker & David Toms _Quantum Field Theory in Curved Spacetime_ is a bit more up to date).

- --

[1] https://en.wikipedia.org/wiki/Einstein_field_equations

[2] metric exploded out form (without cosmological constant): https://wikimedia.org/api/rest_v1/media/math/render/svg/7da0... versus the Einstein Tensor https://wikimedia.org/api/rest_v1/media/math/render/svg/2174...

[3] https://en.wikipedia.org/wiki/Linearized_gravity

[4] With a different metric, the momentum-energy flux through a point in spacetime can be gravitational too, and that is a critical source of nonlinearity; if we slightly modify our choices above to allow for it, a gravitational wave moving in the x direction will influence the T_i1 and T_1j components of the stress-energy tensor, for instance. Returning to the quantum question, that means that a field value at that point might be excited by the passing wave; with a strong enough excitation, perhaps a particle might appear there that would not have if the gravitational wave had not been coincident at that point.


My analysis shows an AWS ELB changes ip addresses on us roughly every 2 weeks. Often enough to cause problems if you aren't prepared but infrequent enough to give you a false confidence that things are working as designed.

"Of course foreigners steal your job, but maybe, if someone without contacts, money, or speaking the language steals your job, you're shit." ~ Louis C. K.

As one of those people stealing your jobs. I agree. And I don't even have to be on-site to do it. Because internet is awesome.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: