Managing a Huge Array on a Unix/Linux System

In summary, a programmer is seeking advice on how to handle a very large amount of data in a C++ program. They are considering using a swap file on disk, but another user suggests using a sparse array implementation. The programmer also asks about memory mapping and is advised to use it to access only parts of the file at a time. A class is suggested to handle this, with possible optimizations for efficiency. The programmer also raises concerns about resizing the file and proper cleanup. Finally, another user suggests finding ways to avoid processing the entire array due to the large size.
  • #1
Hurkyl
Staff Emeritus
Science Advisor
Gold Member
14,983
28
My array is big. No, bigger than that. It's big!

I'm writing a C++ program on a unix box (or is it linux? I never notice the difference), and I want to operate on a very large amount of data: more than 2^32 bytes.

Do I have any options aside from writing my own code to manage a swap file on disk?
 
Computer science news on Phys.org
  • #2
Do you actually have 2^32 non-zero bytes? Or is most of your array zeroes? If so, you might want to implement (or use an existing) sparse array.

- Warren
 
  • #3
The whole thing does get filled. I -might- be able to work with only part of the array at a time, and I -might- be able to identify large portions of the array as irrelevant, but I want to have an approach ready if I'm unable to manage either of these.

(And, of course, there's raw, unadultered curiosity that needs satisfied! :smile:)
 
  • #4
Well, the easiest solution from an application-code perspective is to make a 4+ GB partition on your disk and use it for swap. Then you should be able to call malloc() and have it actually allocate 4GB for you in one block. There's a good chance you'll have to recompile your linux kernel to handle >4 GB of virtual memory address space, since it'll require 64-bit pointers. I think it's possible, but it might get hairy.

You might also want to consider memory-mapping your 4 GB data file (mmap() in sys/mman.h). This will still require 4+ GB of virtual address space, and still might get hairy.

Either of those two solutions would allow you to access elements of your huge array with nary a line of special code.

But... don't do any of that.

The easiest solution is to memory map *part* of the file at a time -- say, 1 GB or 2 GB of it at once, and switch from one segment to the other when necessary. That way you won't have to worry about using 64-bit pointers or any of that hairy stuff I mentioned in the first two paragraphs of this post. Memory-mapping doesn't actually read the file's contents into memory or anything, so there's not much of a performance hit in switching from one part of the file to another. You will, of course, have to abstract your array to automatically map the correct part of the file each time it is accessed.

- Warren
 
  • #5
All right. I've never used any of the memory mapping functions... I've peeked at the man pages, but they weren't too enlightening last time I looked. Do you know of a link (or can you give yourself) a brief intro to using them?
 
  • #6
http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html

That's mmap() POSIX manpage.

If I were you, I would make a C++ class that overloads the [] operator to automatically check which segment is currently mapped, and map a new segment if it's not the right one. The operative code might be something like

Code:
class HugeArray {
 public:
   // segmentSize is in number of slots, NOT bytes
   HugeArray(int fileDescriptor, int segmentSize) {
     this->segmentSize = segmentSize;
     this->fileDescriptor = fileDescriptor;
     currentSegment = -1;
     segment = 0;
   }

   ~HugeArray() {
      delete segment;
   }

   int& operator[] (unsigned int i);
 
 private:
   int segmentSize;
   void* segment;
   int currentSegment;
   int fileDescriptor;
 };

int& HugeArray::operator[](unsigned int i) {
  int neededSegment = (int) (i / segmentSize);
  if (neededSegment != currentSegment) {
    delete segment;
    segment = mmap(0, segmentSize * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE, fileDescriptor, neededSegment * segmentSize * sizeof(int));
    currentSegment = neededSegment;
  }
  return ((int*) segment)[i - (currentSegment * segmentSize)];
}

I'm not on a 'nix box where I can check this for syntax and functionality right now, but I will be glad to help you later if you need more help.

- Warren
 
Last edited:
  • #7
BTW, if you're doing a whoooole lot of array accesses, there are a few optimizations you can make here and there to this code... though a good compiler may already do them for you.

- Warren
 
  • #8
Oh, and you should probably use MAP_SHARED instead of MAP_PRIVATE if you're going to be storing changes to the array back to disk.

- Warren
 
  • #9
I'm generating the array in code rather than reading it from disk... will mmap resize the file as necessary, or will I need to make it myself?


My program will need the array until it's finished, but for future reference, is there any additional cleanup code that would be necessary, or is deleting the mapped memory and closing the FD sufficient?
 
  • #10
I'm pretty sure that mmap will resize the file as needed. I'm also pretty sure you don't need to do any special cleanup beyond closing the file descriptor properly. Don't quote me on that, though -- you might want to run a couple of tests first.

- Warren
 
  • #11
Hurkyl, you should seriously consider identifying irrelevant parts of that array. Even the most trivial processing (say, find the sum modulo 2^32) of an array with the size 2^32 will take a LOT of time.
 
  • #12
rgrig is right. Secondly, with disk accesses it will be even slower.
 
  • #13
I might have exagerated a bit. The code:
Code:
unsigned i;
for (i = 0; i != unsigned(-1); ++i);
runs in about 4.5 seconds on a Pentium 4 at 2.66GHz, when compiled with gcc version 3.3 with the flag -O2 (i.e. optimizations turned on).
 
  • #14
Even so, the first thing anyone should ever do when facing an array of this size is to find ways to avoid processing the entire array. :smile:

- Warren
 

FAQ: Managing a Huge Array on a Unix/Linux System

What is a huge array and why is it important to manage it?

A huge array is a large collection of data stored in a contiguous block of memory. It is important to manage it in order to efficiently access and manipulate the data, optimize memory usage, and prevent potential errors or crashes.

How do you create a huge array on a Unix/Linux system?

To create a huge array on a Unix/Linux system, you can use the mmap system call or the malloc function. These methods allocate the required memory space and return a pointer to the beginning of the array.

What are some common challenges when managing a huge array on a Unix/Linux system?

Some common challenges include efficient memory usage, avoiding memory leaks or segmentation faults, and handling concurrent access to the array by multiple processes.

How can you optimize memory usage when working with a huge array?

One way to optimize memory usage is to use virtual memory techniques such as mapping only the necessary parts of the array into memory. Additionally, you can use data compression techniques to reduce the size of the array and therefore save memory space.

What are some best practices for managing a huge array on a Unix/Linux system?

Some best practices include using appropriate data structures for the type of data stored in the array, regularly checking for errors and handling them properly, and using efficient algorithms for data manipulation. It is also important to regularly monitor memory usage and make adjustments as needed.

Back
Top