Skip to content Skip to sidebar Skip to footer

Split Csv Column In Subcolumns Using Numpy Dtype And Converters

I have a csv file with some columns containing a measured value including error values. I want to import everything to python using numpy genfromtxt and format my array by using dt

Solution 1:

With your dtype, and 4 columns, it works (nested dtype and all)

In [58]: TypeValErr = np.dtype([("value", np.int32), ("error", np.int32)])
    ...: TypeCSV=np.dtype({"names": ["name", "time", "intensity"],
    ...:                   "formats": ["U32", np.int32, TypeValErr],
    ...:                   "titles": ["Name", "Time", "Intensity"]})
    ...: 
In [59]: txt=b"""# Name, Time, Intensity
    ...: Sample1, 300, 1000, 5
    ...: Sample2, 300, 1500, 2"""
In [60]: 
In [60]: data=np.genfromtxt(txt.splitlines(), dtype=TypeCSV, delimiter=',',skip_header=True)
In [61]: data
Out[61]: 
array([('Sample1', 300, (1000, 5)), ('Sample2', 300, (1500, 2))], 
      dtype=[(('Name', 'name'), '<U32'), (('Time', 'time'), '<i4'), (('Intensity', 'intensity'), [('value', '<i4'), ('error', '<i4')])])

So it is able to take a flat list of values, e.g. ['Sample1', 300, 1000, 5] and map them on the nested tuples need to save this dtype: ('Sample1', 300, (1000, 5)).

But the converter does not turn ['Sample1', '300', '1000+-5'] into ['Sample1', '300', (1000, 5)], or if it does it isn't the right thing for subsequent use.

dtype_flat in the error message is:

In [70]: np.lib.npyio.flatten_dtype(TypeCSV)
Out[70]: [dtype('<U32'), dtype('int32'), dtype('int32'), dtype('int32')]

So your nested dtype is produced with an sequence like this:

In [75]: rows=np.array(('str',1,2, 3),dtype=[('',_) for _ in np.lib.npyio.flatten_dtype(TypeCSV)])
In [76]: rows.view(TypeCSV)
Out[76]: 
array(('str', 1, (2, 3)), 
      dtype=[(('Name', 'name'), '<U32'), (('Time', 'time'), '<i4'), (('Intensity', 'intensity'), [('value', '<i4'), ('error', '<i4')])])

In fact there's a comment to that effect just before the error line

    if len(dtype_flat) >1:
        # Nested dtype, eg [('a', int), ('b', [('b0', int), ('b1', 'f4')])]
        # First, create the arrayusing a flattened dtype:
        # [('a', int), ('b1', int), ('b2', float)]
        # Then, view the arrayusing the specified dtype.
        if 'O'in (_.char for _ in dtype_flat):
        ...
        else:
            rows= np.array(data, dtype=[('', _) for _ in dtype_flat])
            output = rows.view(dtype)

data at this point is a list of 'row` tuples, which have already been passed through the converters:

rows = list(
        zip(*[[conv._strict_call(_r) for _r inmap(itemgetter(i), rows)]
              for (i, conv) inenumerate(converters)]))

simplified the conversion process is

In [84]: converters = [str, int, int, int]
In [85]: row= ['one','1','2','3']
In [86]: [conv(r) for conv, r in zip(converters, row)]
Out[86]: ['one', 1, 2, 3]

but actually closer to:

In [87]: rows= [row,row]
In [88]: rowsOut[88]: [['one', '1', '2', '3'], ['one', '1', '2', '3']]
In [89]: from operator import itemgetter
In [90]: [[conv(r) for r in map(itemgetter(i), rows)] for (i, conv) in enumerate(converters)]
Out[90]: [['one', 'one'], [1, 1], [2, 2], [3, 3]]
In [91]: list(zip(*_))
Out[91]: [('one', 1, 2, 3), ('one', 1, 2, 3)]

So the long and short is that converters cannot split a column into 2 or more columns. The process of splitting, converting, and then mapping onto the dtype occurs in the wrong order for this. What I demonstrated at the start is probably easist - pass your file, line by line through a text processing line. It would replace the +- with the specified delimiter. Then the file will have the correct number of columns to work with your dtype.

Post a Comment for "Split Csv Column In Subcolumns Using Numpy Dtype And Converters"