Skip to content Skip to sidebar Skip to footer

Efficient Way To Create Numpy Arrays From Binary Files

I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure: File Header 149 Byte ASCII Header Record Start 4 Byte Int -

Solution 1:

Some hints:

Something like this (untested, but you get the idea):

import numpy as np

file = open(input_file, 'rb')
header = file.read(149)

# ... parse the header as you did ...

record_dtype = np.dtype([
    ('timestamp', '<i4'), 
    ('samples', '<i2', (sample_rate, 4))
])

data = np.fromfile(file, dtype=record_dtype, count=number_of_records)
# NB: count can be omitted -- it just reads the whole file then

time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

Solution 2:

One glaring inefficiency is the use of hstack in a loop:

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

On every iteration, this allocates a slightly bigger array for each of the series and copies all the data read so far into it. This involves lots of unnecessary copying and can potentially lead to bad memory fragmentation.

I'd accumulate the values of time_stamp in a list and do one hstack at the end, and would do exactly the same for record_t etc.

If that doesn't bring sufficient performance improvements, I'd comment out the body of the loop and would start bringing things back in one a time, to see where exactly the time is spent.


Solution 3:

Numpy supports mapping binary from data directly into array like objects via numpy.memmap. You might be able to memmap the file and extract the data you need via offsets.

For endianness correctness just use numpy.byteswap on what you have read in. You can use a conditional expression to check the endianness of the host system:

if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
  # Host is big-endian, in-place conversion
  arrayName.byteswap(True)

Solution 4:

I have got satisfactory results with a similar problem (multi-resolution multi-channel binary data files) by using array, and struct.unpack. In my problem, I wanted continuous data for each channel, but the file had an interval oriented structure, instead of a channel oriented structure.

The "secret" is to read the whole file first, and only then distribute the known-sized slices to the desired containers (on the code below, self.channel_content[channel]['recording'] is an object of type array):

f = open(somefilename, 'rb')    
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
            position += samples

Of course, I cannot state this is better or faster than other answers provided, but at least is something you might evaluate.

Hope it helps!


Post a Comment for "Efficient Way To Create Numpy Arrays From Binary Files"