Math - - (+0|-0)

My Arguments for sideways/transposed matrix multiplication having near universal adoption(self)

The sub I put this in is Math but it could equally have been put under machine learning or programming.
The first thing to note is that almost all hardware is doing it already. The idea of having a different way of thinking about matrix multiplication is not that strange.
But before I get too into it I should explain what sideways matrix multiplication is. Trust me. It isn't that hard to understand, which is kind of why I want it to be the default.
So with matrix multiplication we are taking two grids of numbers doing some operation and getting another grid out. My argument is that "sideways" multiplication is the more obvious and sensible way to do it.
For two matrices to be multiplied (in this form) they need to have equal widths. To perform the operation you just take the dot product of every pairing of a row in each matrix. You start with the first row of the left matrix and the first row of the right matrix, multiply element-wise and add up. A dot product. That is the first element in the result. Then you just iterate over all the rows of the right matrix to fill in the first row of the result matrix. Then you continue onto the second row of the left matrix and you are now filling in the second row of the result matrix.
Why is this better? Well there are about 10 reasons.
One is it is easier to remember the rules for if two matrices can be multiplied. You don't need to remember that the width of the left one needs to match the height of the right one, or is i the height of left one and the width of the right one? No. When we want to see that an operation is valid we want to note that two things share some like property. Being able to add two scalars. They have the like property that they are scalars. Vectors, same thing, and the property of having the same length. There is a certain type agreement relationship we want things to have to define that an operation can be done on a pair.
The second is that for both mathematical and programming reasons it is nice if a more complex operation can be described as a composite of operations that are one level less in terms of complexity. Even if that's not how you end up implementing it in practice for performance reasons, it is keeping with functional design that you should be able to implement it that way.
Speaking of performance, it is more performant. Both for naive code and for highly optimized code. Because you are repeatedly performing a dot product on different segments of contiguous memory memory access is better and it is easier to leverage SIMD/AVX. It also removes a stride variable from the code. This also makes the most optimized code, which is often hard enough to read, involve one less element of complexity. This is why most math libraries are doing this kind of matrix multiplication under the hood depending on a size check. In pytorch and many others one of your matrices will be transposed to another allocation of memory before going out to the more efficient algorithm. And there is a waste that occurs with every operation just to give mathematicians a more familiar operation that they were taught in school.
There are a lot of other nice properties. One is that when matrices are described by (width,height), another standard mathematicians need to get on board with, the right matrix becomes a verb that can be read (from,to). So a (5x3) x (5x10) -> (10,3). The five mapped to 10. We mapped a 5 field matrix to a 10 field matrix, and maintained 3 entries.
The next cool property is semi-commutability. Maybe this one isn't that cool because it won't give you the same result, but you can swap the order and it's still a valid operation. I think that's cool. 5 field matrices can always be multiplied.
Another reason is some software already uses this method by default. ggml which is the library behind LlamaCPP (written by the same person) uses this method exclusively. Therefore when weights are translated from one file format to .ggml the weights are transposed. So now we have less format consistency because of the mathematicians hanging onto their inelegant way of describing multiplication.
This also means that there is a lot of code duplication and longer function names for both implementations. This also means that the lowest level implementations are going to be less accessible to higher level users because the format expected will not match how they were educated and f32matrixmultiplytransposed has one more name making it less reasonable to utilize in high level code. And that name is going to have to remain at any level of code because it will break what the mathematicians expect.
In short it is easier for humans to read. It is more obvious to derive. It is easier for computers to read. Code for it performs better. Code for it follows better design principles. Switching to it universally shortens function names. It reduces bloated libraries with a million variants of the same function by half.
Mathematicians are the aces of abstract reasoning. I think they can deal with the shift. While the engineers and programmers can have about 5 different headaches caused by the current standard permanently checked off of the list among the many other concerns we have to deal with.
I'm mostly thinking about this from a machine learning perspective. I don't know what the impact would be on computer graphics. Probably speed them up because you would be getting rid of a hidden transpose operation.

Comment preview

[-]High_Quality_Dick_Pics

1(+1|0)

This is good. I like this. I will have to re-read and do a deep think on this, but you are correct in this observation. So you suggest that instead of the traditional matrix multiplication rule - number of columns in the first matrix must equal the number of rows in the second - we should use a multiplication where the rows of both matrices are paired for dot products? So its row to row not column to row?

parent linkreply

[-]x0x7

0(+0|0)

Yes. It already exists in code all over the place. Humans just don't think in those terms, but I think it would be a benefit if they did.

parent linkreply

[-]iSnark

0(+0|0)

All of this but we can't put timestamps on chat! ;-)

parent linkreply

Score		1
Age		1
Proximity		1
Bump		1