• 6

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191


File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

Numpy loading csv TOO slow compared to Matlab

I posted this question because I was wondering whether I did something terribly wrong to get this result.

I have a medium-size csv file and I tried to use numpy to load it. For illustration, I made the file using python:

import timeit
import numpy as np

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')

And then, I tried two methods: numpy.genfromtxt, numpy.loadtxt

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
stmt2 = """\
my_data = np.loadtxt('./test.csv', delimiter=',')

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)

And the result shows that t1 = 32.159652940464184, t2 = 52.00093725634724.
However, When I tried using matlab:

for i = 1:3
    my_data = dlmread('./test.csv');

The result shows: Elapsed time is 3.196465 seconds.

I understand that there may be some differences in the loading speed, but:

  1. This is much more than I expected;
  2. Isn't it that np.loadtxt should be faster than np.genfromtxt?
  3. I haven't tried python csv module yet because loading csv file is a really frequent thing I do and with the csv module, the coding is a little bit verbose... But I'd be happy to try it if that's the only way. Currently I am more concerned about whether it's me doing something wrong.

Any input would be appreciated. Thanks a lot in advance!

Yeah, reading csv files into numpy is pretty slow. There's a lot of pure Python along the code path. These days, even when I'm using pure numpy I still use pandas for IO:

>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms

Alternatively, in a simple enough case like this one, you could use something like what Joe Kington wrote here:

>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s

There's also Warren Weckesser's textreader library, in case pandas is too heavy a dependency:

>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s
  • 44
Reply Report
    • Thank you very much! The pd.read_csv works great for me - in fact it finished in only half the time that MATLAB took! And also thanks for the other two very informative methods with lighter weight.
      • 1
    • The speed is not the only thing to care about. As for me, both np.genfromtxt and pd.read_csv require more RAM than I have to read a 1,209,836,036 byte text file. The former does not care and hangs the system, however the latter throws an error. np.fromfile is almost 4 times quicker than np.loadtxt. The two do not take much memory to run.

If you want to just save and read a numpy array its much better to save it as a binary or compressed binary depending on size:

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')
np.save('./testy', my_data)
np.savez('./testz', my_data)
del my_data

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
stmt2 = """\
my_data = np.load('./testy.npy')
stmt3 = """\
my_data = np.load('./testz.npz')['arr_0']

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)
t3 = timeit.timeit(stmt=stmt3, setup=setup_stmt, number=3)

genfromtxt 39.717250824
save 0.0667860507965
savez 0.268463134766
  • 5
Reply Report
      • 2
    • Thank you Ophion! This is a great answer, and really useful - I have been using cPickle but now realized that np.savez is faster and more compact than cPickle, as long as only ndarray are used. I did not mark "accept" because in this question I was trying to read data from experiment data saved by LabVIEW. But still, thank you so much!

Perhaps it's better to rig up a simple c code which converts the data to binary and have `numpy' read the binary file. I have a 20GB CSV file to read with the CSV data being a mixture of int, double, str. Numpy read-to-array of structs takes more than an hour, while dumping to binary took about 2 minutes and loading to numpy takes less than 2 seconds!

My specific code, for example, is available here.

  • 2
Reply Report

FWIW the built-in csv module works great and really is not that verbose.

csv module:

with open('test.csv', 'r') as f:
    np.array([l for l in csv.reader(f)])

1 loop, best of 3: 1.62 s per loop


%timeit np.loadtxt('test.csv', delimiter=',')

1 loop, best of 3: 16.6 s per loop


%timeit pd.read_csv('test.csv', header=None).values

1 loop, best of 3: 663 ms per loop

Personally I like using pandas read_csv but the csv module is nice when I'm using pure numpy.

  • 1
Reply Report
    • I know this is an old question, but if you are still using pure numpy, you can still use pandas for IO and then use `pd.DataFrame.values to extract the numpy array.

I've performance-tested the suggested solutions with perfplot (a small project of mine) and found that


is indeed the fastest solution (if more than 2000 entries are read, before that everything is in the range of milliseconds). It outperforms numpy's variants by a factor of about 10. (numpy.fromfile is here just for comparison, it cannot read actual csv files.)

enter image description here

Code to reproduce the plot:

import numpy
import pandas
import perfplot

filename = "a.txt"

def setup(n):
    a = numpy.random.rand(n)
    numpy.savetxt(filename, a)
    return None

def numpy_genfromtxt(data):
    return numpy.genfromtxt(filename)

def numpy_loadtxt(data):
    return numpy.loadtxt(filename)

def numpy_fromfile(data):
    out = numpy.fromfile(filename, sep=" ")
    return out

def pandas_readcsv(data):
    return pandas.read_csv(filename, header=None).values.flatten()

def kington(data):
    delimiter = " "
    skiprows = 0
    dtype = float

    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        kington.rowlength = len(line)

    data = numpy.fromiter(iter_func(), dtype=dtype).flatten()
    return data

    kernels=[numpy_genfromtxt, numpy_loadtxt, numpy_fromfile, pandas_readcsv, kington],
    n_range=[2 ** k for k in range(20)],
  • 1
Reply Report

Trending Tags