Monday, 31 December 2018

How to load a sparse matrix efficiently?

Given a file with this structure:

  • Single column lines are keys
  • Non-zero values of the keys

For example:

abc
ef 0.85
kl 0.21
xyz 0.923
cldex 
plax 0.123
lion -0.831

How to create a sparse matrix, csr_matrix?

('abc', 'ef') 0.85
('abc', 'kl') 0.21
('abc', 'xyz') 0.923
('cldex', 'plax') 0.123
('cldex', 'lion') -0.31

I've tried:

from collections import defaultdict

x = """abc
ef  0.85
kl  0.21
xyz 0.923
cldex 
plax    0.123
lion    -0.831""".split('\n')

k1 = ''
arr = defaultdict(dict)
for line in x:
    line = line.strip().split('\t')
    if len(line) == 1:
        k1 = line[0]
    else:
        k2, v = line
        v = float(v)
        arr[k1][k2] = v

[out]

>>> arr
defaultdict(dict,
            {'abc': {'ef': 0.85, 'kl': 0.21, 'xyz': 0.923},
             'cldex': {'plax': 0.123, 'lion': -0.831}})

Having the nested dict structure isn't as convenient as the scipy sparse matrix structure.

Is there a way to read the file in the given format above easily into any of the scipy sparse matrix object?



from How to load a sparse matrix efficiently?

No comments:

Post a Comment