How to store a large data-set in a "grid" in Python/NumPy

  • #1
ergospherical
1,055
1,347
Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.

Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?
 
Technology news on Phys.org
  • #2
An array one thousand, by one thousand, is a mega-element array. Stack overflow is probable.
ergospherical said:
What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?
That will depend on how sparse the data is in the grid. Maybe you do not need to store all that zero data.

Can you arrange the records needed together, to be read together from storage?
 
  • Like
Likes ergospherical
  • #3
ergospherical said:
Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.
Unless the values of each ## X_n ## are restricted to a finite set of integers I don't see the analogy to a grid or an nD matrix. Instead I see a 2D matrix with k rows (k is the number of observations) and n + 1 columns [X0, X1, ..., Xn-1, f(X)] (note I have renumbered Xi to start at zero as is the convention for almost all computing apart from Fortran).

ergospherical said:
Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy?

To generate the grid, and then be able to read sections of the grid efficiently?
Assuming it will all fit in memory, almost certainly a numpy.array. This can be stored on disk as a CSV file which is probably better for portability than anything else.

If 8k(n + 1) > (available memory) then you could try an SQL database (with indexes on all Xi columns) but if this is too slow it's going to be difficult.
 
  • Like
Likes ergospherical
  • #4
Thousands of values in each direction is -by modern standards- not large at all. A Numpy array can easily handle that.
If you want a more efficient file format than CSV you can have a look at HDF5, not quite as portable but widely supported and much more efficient.
 
  • Like
Likes Lischka and phyzguy
  • #5
Is there a good way of reducing the RAM usage if I'm writing to a large numpy array? I've followed @pbuk's suggestion so far, writing to a 2d numpy array of size ##k \times (n+1)##. But I'm also using Google Colab and there's only 12.7GB of system RAM which is quickly used up, causing it to crash. Is there a way to write each row to a CSV file sequentially, instead of waiting for the entire numpy array to populate first?
 
  • #6
Where is the data coming from? Can you create it a row at a time and append to a file?
 
  • #7
Yeah, that works actually.
 
  • Like
Likes pbuk
  • #8
Of course that does indicate that you are going to have problems reading it all back in again, at least in Google Colab. Options include processing in batches and transferring to a local workstation with a bit more oomph.
 
  • #9
Yeah, I think I’m going to give up on Colab. I was only using it because it’s convenient to share progress with the rest of the team but that’s less important…
 
  • #10
If your data can use sparse arrays, the size problem will disappear.

Alternatively, maintain contact only with the rows and columns of the disk file that are needed at that time. Do that with two smaller arrays in memory, keep them smaller than the CPU data cache.
 
  • #11
Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.
 
  • #12
Yes I'm writing to HDF5 now
 
  • #13
f95toli said:
Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.
CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
 
  • #14
pbuk said:
If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
I have certainly used CSV and reading the entire file into memory for datasets with much more than a few hundred rows (I don't know that my column count has ever gotten that high). However, the computations I was doing on the data were fairly simple and didn't require a lot of additional objects that would take up a lot of additional memory. The latter might not be the case for the kinds of computations the OP is doing.
 
  • #15
pbuk said:
CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
I have also used CSV for larger datasets. However, these days we try to avoid using it as a format.
There are a couple of reasons. Firstly, if you are working with dataset consisting of hundreds for rows and columns you are probably doing something that involves processing relatively large amount of data, and even if CSV work fine fo the dataset you are using initially it might not work if you suddenly find that you need scale to use thousands or rows and columns. It is often better to just use a more flexible dataset to start with. Once you are used to it HDF5 (and similar format) are not harder to use than .CSV.
Secondly, formats such as HDF5 are much more flexible when it comes to structuring the data, this is especially important if you e.g. find that you suddenly need to add another "dimension" and run calculations (or in our case measurements) vs. another variable) . This can be done using CSV as well, simply by saving multiple files; but processing thousands of CSV files, even if each one is relatively small (say ~a few MB in size) gets very slow. Moreover, HDF5 is way better for meta-data so used right it is much easy to understand the content of an old file even if the documentation has been lost.
 

Similar threads

Replies
1
Views
10K
Replies
8
Views
2K
Replies
7
Views
4K
Replies
6
Views
3K
Replies
6
Views
1K
Replies
4
Views
1K
Replies
2
Views
1K
Replies
3
Views
2K
Back
Top