- #1
galliaproject
How did you find PF?: Google search
Hi all,
I'm looking for a good example of a large dataset of non image-centric physics data (e.g. astronomy, particles, ...) so I can add an example to this section of my documentation (formal announcement for the Gallia library: see Scala users forum).
I looked around for instance on the Awesome public datasets page but it doesn't link directly to the data in most cases, and I've gotten lost going to the rabbit holes too many times already. I think I'm just not familiar enough with the domain. For reference, here's what a great counterpart in bioinformatics data would be for what I need: ICGC data for skin cancer (it adds up to ~40GB of data once uncompressed).
It'd be nice if the dataset was similarly large-ish (despite no images) in terms for "rows", as in: not fitting your typical customer-grade computer's memory. It could be in pretty much any format among json/tsv/csv/avro/parquet and the likes (see Gallia's input section), and it doesn't have to be all in one file either. It can't however have millions or billions of columns, a single "row" has to fit memory at this time as Gallia is not a particularly column-focused data processing tool.
Any pointers would be greatly appreciated!
Thanks,
Anthony
Hi all,
I'm looking for a good example of a large dataset of non image-centric physics data (e.g. astronomy, particles, ...) so I can add an example to this section of my documentation (formal announcement for the Gallia library: see Scala users forum).
I looked around for instance on the Awesome public datasets page but it doesn't link directly to the data in most cases, and I've gotten lost going to the rabbit holes too many times already. I think I'm just not familiar enough with the domain. For reference, here's what a great counterpart in bioinformatics data would be for what I need: ICGC data for skin cancer (it adds up to ~40GB of data once uncompressed).
It'd be nice if the dataset was similarly large-ish (despite no images) in terms for "rows", as in: not fitting your typical customer-grade computer's memory. It could be in pretty much any format among json/tsv/csv/avro/parquet and the likes (see Gallia's input section), and it doesn't have to be all in one file either. It can't however have millions or billions of columns, a single "row" has to fit memory at this time as Gallia is not a particularly column-focused data processing tool.
Any pointers would be greatly appreciated!
Thanks,
Anthony