Trying to run a big python job with AWS (EC2), is there a better way?

ergospherical · Oct 18, 2024

I've got some code in a public repo with a module containing my model and a python script which runs the model and returns a .parquet file with data. I've parallelized all of the important processes with Joblib, and the underlying code is written in pure C so I can't make it much faster (at least, to the best of my knowledge. I'm sure someone else can).

That said, it's taking about 3 days to run on my crappy laptop. Did some digging and found that Amazon AWS might be sensible:

https://docs.aws.amazon.com/systems...egration-github-python.html?tag=pfamazon01-20

There are no EC2 free tier offers, so it's a good idea to make sure it's the right option before proceeding. Are there better alternatives (I just want to run fast on a big compute-optimized cluster, and I don't need much memory or storage optimization).

Has anyone used EC2 before? Is it relatively painless to set up?

Vanadium 50 · Oct 18, 2024

If your code is not running fast enough, profile it, see where its spending its time and whether that can be sped up. Don't just assume "C is fast". Bad algorithms can be implemented in C too.

ergospherical · Oct 18, 2024

The integrators in the model are pretty well optimized, the job is just simulating millions of particles for timescales of Gigayears so I reckon it's always going to be a computationally difficult job. I have just gotten a free EC2 instance up and running, but currently I've chosen a free version which is less powerful than my laptop.

FactChecker · Oct 18, 2024

ergospherical said:

the job is just simulating millions of particles for timescales of Gigayears

That's a lot of particles and years. Do you have any intermediate indicators of how far the calculation has progressed in 3 days? There are a lot of problems what can not be solved even on supercomputers.

My experience on very long-running programs is to periodically store intermediate results so that the program can be continued from an intermediate point. There are power-outages, unplanned system resets, etc.

I have no experience with massively parallel algorithms, but you might want to look into the possibility of utilizing a GPU for parallelization.

Vanadium 50 · Oct 18, 2024

Ah, if I had a dime for everyone who said "My code doesn't need to be profiled. It's already optimal" I'd be a wealthy man.

pbuk · Oct 18, 2024

ergospherical said:

Has anyone used EC2 before? Is it relatively painless to set up?

Yes. I'd say that of all the cloud offerings AWS is in general the most painful to set up, but this is just an inconvenience.

ergospherical said:

I have just gotten a free EC2 instance up and running, but currently I've chosen a free version which is less powerful than my laptop.

The free instances are a waste of time for this - they are intended for developing and staging websites.

To get anything better than an average desktop (top gaming machine performance) you need at least say c8g.8xlarge (32vCPU 64GB RAM) but bigger is always better so c8g.48xlarge will give you 192vCPU/384GB. At "on demand" prices they will cost you $1.27 and $7.63 an hour respectively so don't do that: "spot" prices (where your job is fitted around other workloads) are generally around 15% of the cost.

Pricing on Azure tends to be cheaper for committed resource (which you don't need) but more expensive for spot, but this is a generalization. There are other providers with various levels of price/performance/support e.g. Google at the top, Vultr in the middle, at the bottom I won't mention any names but I have not used them for HPC.

ergospherical said:

The integrators in the model are pretty well optimized, the job is just simulating millions of particles for timescales of Gigayears so I reckon it's always going to be a computationally difficult job.

'just' doing galaxy sims? I'm sure I don't need to point out that this is a challenging task and many sub-optimisations need to be considered (e.g. Barnes Hut, multipole...). Are you using GADGET-2?

Other considerations come to mind: are you sure your model retains meaningful accuracy over this sort of timescale? How are you dealing with tricky things: collisions, stellar evolution, GR in general?

Have you looked at compiling your code for GPU? I would have thought you would want to do that, whether for running on your own hardware or in the cloud.

Finally do you not have access to an academic institution's computing facilities, or have they already pointed you at AWS?

ergospherical · Oct 19, 2024

There's several tuning parameters in the code that control the accuracy of the simulations; I've previously turned these way down, and the results you end up with are qualitatively correct -- even for the lower level of accuracy. Now it would be good to do a higher accuracy run, except the computation time scales exponentially with the tuning parameters. The bottleneck in the compute speed is the numerical integration of the dynamics, which is a fully parallelized tree code.

How do you mean compile for GPU -- compile into an MPS runtime and deploy that?

I spun up a v96 (c5.24xlarge) instance on AWS, it is faster (compute time ~ 1 day) but also quite expensive. I think this at least demonstrates that the parallelism is working and the code is effectively distributing the jobs across the cores.

The spot pricing is a bit confusing to me -- when I tried to set it up, AWS is gives me a whole load of separate instances. With a single c5.24xlarge instance, I can just connect via SSH, clone the repo & install requirements.txt then run the script. To use the spot pricing service, I don't know if I have to do some more work to manually distribute the job across all of the provided instances

pbuk · Oct 20, 2024

ergospherical said:

How do you mean compile for GPU -- compile into an MPS runtime and deploy that?

If you want to maximize performance gains you will need to do more than that. There are some good references in https://developer.nvidia.com/gpugem...lation/chapter-31-fast-n-body-simulation-cuda

I think there are also some quirks to running the AWS GPU instances, see the SDK documentation for these.

It might also be interesting to compare the performance of a multipole method with your tree method - see for instance https://academic.oup.com/mnras/article/506/2/2871/6312509.

ergospherical said:

The spot pricing is a bit confusing to me -- when I tried to set it up, AWS is gives me a whole load of separate instances. With a single c5.24xlarge instance, I can just connect via SSH, clone the repo & install requirements.txt then run the script. To use the spot pricing service, I don't know if I have to do some more work to manually distribute the job across all of the provided instances

No, the separate instances are independent options. Basically you set up a job (in your case clone a repo, run a script, write some output to somewhere persistent e.g. S3) and choose from the big list of instances what hardware you want to run it on. The wider your choice of hardware the sooner you will get an available spot instance. Because your job is long-running you will probably find the instance is terminated part way through and you will want to save state periodically so that the job can recover.

Trying to run a big python job with AWS (EC2), is there a better way?

Similar threads

Hot Threads

Recent Insights