Split Csv Column In Subcolumns Using Numpy Dtype And Converters

March 31, 2024 Post a Comment

I have a csv file with some columns containing a measured value including error values. I want to import everything to python using numpy genfromtxt and format my array by using dt

Solution 1:

With your dtype, and 4 columns, it works (nested dtype and all)

In [58]: TypeValErr = np.dtype([("value", np.int32), ("error", np.int32)])
    ...: TypeCSV=np.dtype({"names": ["name", "time", "intensity"],
    ...:                   "formats": ["U32", np.int32, TypeValErr],
    ...:                   "titles": ["Name", "Time", "Intensity"]})
    ...: 
In [59]: txt=b"""# Name, Time, Intensity
    ...: Sample1, 300, 1000, 5
    ...: Sample2, 300, 1500, 2"""
In [60]: 
In [60]: data=np.genfromtxt(txt.splitlines(), dtype=TypeCSV, delimiter=',',skip_header=True)
In [61]: data
Out[61]: 
array([('Sample1', 300, (1000, 5)), ('Sample2', 300, (1500, 2))], 
      dtype=[(('Name', 'name'), '<U32'), (('Time', 'time'), '<i4'), (('Intensity', 'intensity'), [('value', '<i4'), ('error', '<i4')])])

So it is able to take a flat list of values, e.g. ['Sample1', 300, 1000, 5] and map them on the nested tuples need to save this dtype: ('Sample1', 300, (1000, 5)).

But the converter does not turn ['Sample1', '300', '1000+-5'] into ['Sample1', '300', (1000, 5)], or if it does it isn't the right thing for subsequent use.

dtype_flat in the error message is:

In [70]: np.lib.npyio.flatten_dtype(TypeCSV)
Out[70]: [dtype('<U32'), dtype('int32'), dtype('int32'), dtype('int32')]

So your nested dtype is produced with an sequence like this:

In [75]: rows=np.array(('str',1,2, 3),dtype=[('',_) for _ in np.lib.npyio.flatten_dtype(TypeCSV)])
In [76]: rows.view(TypeCSV)
Out[76]: 
array(('str', 1, (2, 3)), 
      dtype=[(('Name', 'name'), '<U32'), (('Time', 'time'), '<i4'), (('Intensity', 'intensity'), [('value', '<i4'), ('error', '<i4')])])

In fact there's a comment to that effect just before the error line

    if len(dtype_flat) >1:
        # Nested dtype, eg [('a', int), ('b', [('b0', int), ('b1', 'f4')])]
        # First, create the arrayusing a flattened dtype:
        # [('a', int), ('b1', int), ('b2', float)]
        # Then, view the arrayusing the specified dtype.
        if 'O'in (_.char for _ in dtype_flat):
        ...
        else:
            rows= np.array(data, dtype=[('', _) for _ in dtype_flat])
            output = rows.view(dtype)

data at this point is a list of 'row` tuples, which have already been passed through the converters:

rows = list(
        zip(*[[conv._strict_call(_r) for _r inmap(itemgetter(i), rows)]
              for (i, conv) inenumerate(converters)]))

simplified the conversion process is

In [84]: converters = [str, int, int, int]
In [85]: row= ['one','1','2','3']
In [86]: [conv(r) for conv, r in zip(converters, row)]
Out[86]: ['one', 1, 2, 3]

but actually closer to:

In [87]: rows= [row,row]
In [88]: rowsOut[88]: [['one', '1', '2', '3'], ['one', '1', '2', '3']]
In [89]: from operator import itemgetter
In [90]: [[conv(r) for r in map(itemgetter(i), rows)] for (i, conv) in enumerate(converters)]
Out[90]: [['one', 'one'], [1, 1], [2, 2], [3, 3]]
In [91]: list(zip(*_))
Out[91]: [('one', 1, 2, 3), ('one', 1, 2, 3)]

So the long and short is that converters cannot split a column into 2 or more columns. The process of splitting, converting, and then mapping onto the dtype occurs in the wrong order for this. What I demonstrated at the start is probably easist - pass your file, line by line through a text processing line. It would replace the +- with the specified delimiter. Then the file will have the correct number of columns to work with your dtype.

Python Dictionary

Split Csv Column In Subcolumns Using Numpy Dtype And Converters

Solution 1:

Post a Comment for "Split Csv Column In Subcolumns Using Numpy Dtype And Converters"