Skip to content Skip to sidebar Skip to footer

Efficient Way To Iterate Through Coo_matrix Elements Ordered By Column?

I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M). I'm current

Solution 1:

Make a small sparse matrix:

In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]: 
<5x5 sparse matrix of type'<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [84]: print(M)
  (1, 3)    0.03079661961875302
  (0, 2)    0.722023291734881
  (0, 3)    0.547594065264775
  (1, 0)    1.1021150713641839
  (1, 2)    0.585848976928308

That print, as well as the nonzero return the row and col arrays:

In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))

Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.

In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))

I was going to say conversion to csc orders the columns, but it doesn't look like that:

In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))

Transpose of csr produces a csc:

In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))

I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:

In [90]: M.tolil().rows
Out[90]: 
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
      dtype=object)
In [91]: M.tolil().T.rows
Out[91]: 
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
      dtype=object)

In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.

coo doesn't implement indexing or iteration. csr and lil implement those.

Post a Comment for "Efficient Way To Iterate Through Coo_matrix Elements Ordered By Column?"