Diagonally Traversing an Array

Consider the array

[[ a00  a01  a02  a03  a04  a05 ],
 [ a10  a11  a12  a13  a14  a15 ],
 [ a20  a21  a22  a23  a24  a25 ],
 [ a30  a31  a32  a33  a34  a35 ],
 [ a40  a41  a42  a43  a44  a45 ],
 [ a50  a51  a52  a53  a54  a55 ]

let’s say we want to traverse is diagonally as such

diagonal -5:  a50
diagonal -4:  a40 -> a51
diagonal -3:  a30 -> a41 -> a52
diagonal -2:  a20 -> a31 -> a42 -> a53
diagonal -1:  a10 -> a21 -> a32 -> a43 -> a54
diagonal -0:  a00 -> a11 -> a22 -> a33 -> a44 -> a55
diagonal  1:  a01 -> a12 -> a23 -> a34 -> a45
diagonal  2:  a02 -> a13 -> a24 -> a35
diagonal  3:  a03 -> a14 -> a25
diagonal  4:  a04 -> a15
diagonal  5:  a05

I’ve had to think about this question myself sometime back and one of the things that I realize is this traversals of any kind in a 2D grid structure like this essentially boils down to the line or curve equations for the points

[Tl:DR convert diagonal to straight line by shearing + simplify diagonal traversal]

Math

for clarity I will leave out the a and just use numbers. It may help to visualize if you turn the screen so y is facing up (xcoord,ycoord)

--------------------------------------> (+y axis)
|(0,0) (0,1) (0,2) (0,3) (0,4) (0,5)
|(1,0) (1,1) (1,2) (1,3) (1,4) (1,5)(1,6)[assume]
|(2,0) (2,1) (2,2) (2,3) (2,4) (2,5)
|(3,0) (3,1) (3,2) (3,3) (3,4) (3,5)
|(4,0) (4,1) (4,2) (4,3) (4,4) (4,5)
|(5,0) (5,1) (5,2) (5,3) (5,4) (5,5)
|      (6,1)[assume] 
v
(+x axis)

you want a diagonal traversal (5,0)->(4,0)->(5,1)->(3,0)->(4,1)->(5,2) … so on this is bascially drawing 45 degree straight lines

m is the slope c is y-intercept - easy way to understand is that setting x=0 gives y=c ie c is independent of x x,y are the respective coordinates

y = mx + c [eq1]

take any two points on the same diagonal for example

[x1,y1](3,0) and [x2,y2](5,2)

m = (y2-y1)/(x2-x1) = (2-0)/(5-3) = 2/(2) = 1

lets substitute of of the points say [x2,y2](5,2)

2 = 1*5 + c
c = 2-3 = -1

c = -1 [eq2]
y = x - 1 [eq3]

let’s do this for a bunch of other diagonals specially the ones towards the start and end - we don’t need to calculate m again as all lines have the same slope of m=1 we’re just interested in seeing how c changes as we traverse through the matrix

for diagonal [x1,y1](5,0) [x2,y2](6,1)
y = mx + c
1 = 1*(6) + c
c = 1-6 = -5 [eq4]

for diagonal [x1,y1](4,0) [x2,y2](5,1)
y = mx + c
1 = 1*(5) + c
c = 1-5 = -4 [eq5]

... so on untill

for diagonal [x1,y1](0,4) [x2,y2](1,5)
y = mx + c
5 = 1*(1) + c
c = 5-1 = 4 [eq6]

for diagonal [x1,y1](0,5) [x2,y2](1,6)
y = mx + c
6 = 1*(1) + c
c = 6-1 = 5 [eq7]


we see a pattern here 

c [-5,-4...,4,5] 

[NOTE]
c = -5  depends on the number of rows
c = 5   depends on the number of cols   

Code

Now that we understand what’s occuring let’s buld the code i’m going to use C

// let n_rows : number of rows (6) here
// let n_cols : number of cols (6) here

#include "stdio.h"

int main(int argc, char** argv){

    int a[6][6] = 
        {
            {0,     1,      2,      3,      4,      5},
            {6,     7,      8,      9,      10,     11},
            {12,    13,     14,     15,     16,     17},
            {18,    19,     20,     21,     22,     23},
            {24,    25,     26,     27,     28,     29},
            {30,    31,     32,     33,     34,     35},
        };

    int n_rows = 6;
    int n_cols = 6;
    
    int x=0,y=0,c=0;

    for(c = -(n_rows-1); c <= (n_cols-1); ++c){
    // now we get our varying c which allows us to move the line

    // the xvariable for these diagonals goes down from (n_rows-1) to 0
        for(y = 0; y <= n_cols-1 ; ++y){

            x = y - c;
            
            // do bounds checking 
            if(x>= 0 && x<= (n_rows-1)){
                printf("%d ",a[x][y]);
            }

        }
    }
    return 0;
}

Some brainstorming

so this really simplifed the diagonal traversal our next step is to think what if the matrix is not 6x6

what is if is 10000x1000 and your machine can only fit 10x10 into memory

let’s see how we can optmize this further : basic idea is to start streaming the data in chunk that you need and then discard them consider the following grid 6x5 i’m picking arbitrarily now each of the boxes correspond to a 3x3 region in memory

so that is total of 635*3 = 270 elements in total lets’s say my budget allows me to only store 9 elements at a time and space for a small number of constant variables to keep track of state let’s number the grid

[full data grid that needs to be made into chunks]

---.---.---.---.---.
|15|20 |24 |27 |29 |
---.---.---.---.---.
|10|16 |21 |25 |28 |
---.---.---.---.---.
|6 |11 |17 |22 |26 |
---.---.---.---.---.
|3 | 7 |12 |18 |23 |
---.---.---.---.---.
|1 | 4 | 8 |13 |19 |
---.---.---.---.---.
|0 |2  | 5 | 9 |14 |
---.---.---.---.---.

so if you want to diagonally traverse the 270 elements - we would need to diagonall traverse all the 30 chunks as well

[a single chunk]
.---.---.---.
| g | h | i |
.---.---.---.
| d | e | f |
.---.---.---.
| a | b | c |
.---.---.---.

so the traversal fashion will be hierarchical

but here’s the deal - if memory isn’t a contraint or you do have a setup to steam data very fast

here’s a very neat way to do diagonal traversal and maximize cache hits

this would take O((n_rows + n_cols) * n_rows) space

        .---.---.---.
  *   * | g | h | i |
    .---.---.---.---.
  * | d | e | f | $
.---.---.---.---.
| a | b | c | $   $
.---.---.---.

* are empty spots with some illegal value
$ are possible values from future chunks


let this be your new matrix - just traverse each column top to bottom - it is better to vertically flip the matrix as well to maximize leagal hits

.---.---.---.
| a | b | c | $   $
.---.---.---.---.
  * | d | e | f | $
    .---.---.---.---.
  *   * | g | h | i |
        .---.---.---.


this above 'shear' technique will increase cache hits 
at the expense of requiring some data preprocessing and streaming



[full data grid that needs to be made into chunks]

---.---.---.---.---.
|15|20 |24 |27 |29 |
---.---.---.---.---.
|10|16 |21 |25 |28 |
---.---.---.---.---.
|6 |11 |17 |22 |26 |
---.---.---.---.---.
|3 | 7 |12 |18 |23 |
---.---.---.---.---.
|1 | 4 | 8 |13 |19 |
---.---.---.---.---.
|0 |2  | 5 | 9 |14 |
---.---.---.---.---.


coming back to the grid 
diagonal_ids[] = [diag for d in range(-(n_rows-1),(n_cols-1))]

each one of the chunks can be separately processed based on the diagonal ID's

Some sketch of a pseudo code

1. use the shearing mechanism to increase cache hits
    you will have stream data and concatenate sheared blocks loading memory in bigger chunks into ram is fast than smaller chunks
    caches draw from the memory so it'll work fine as long as we can stream sheared chunks

2. calculate diagonal numbers
    y = chunk_y + offset
    x = chunk_x + offset
    diag_id = (y-x) + (n_rows-1)

3. create a threadpool based on the number of available threads

4. launch separate threads and have them process the columns 

Summary

calculate diagonal indices in a simple manner
shear data
shearing converts problem to sequential access
sequential access == cache is happy