Find Experimental Datasets for Python Modeling

  • Python
  • Thread starter Taylor_1989
  • Start date
  • Tags
    Experimental
In summary, I am looking for someone to help me find some projects to improve my python skills. One project I came up with was to experimentally model in python. I searched for sites like Kaggel, but I wasn't able to find any. I did find a couple google sites that directed me to a site called Pizza and Chili, but they weren't what I was looking for. I also found a site called Data.gov, but UC Irvine's website had more options. I would like to produce my own random data to practice with, but I am not sure how to do it. I have mentored excellent student projects using the approach of generating data with a predictive model plus normally distributed random noise. Compared with using real experimental
  • #1
Taylor_1989
402
14
I am currently looking to improve my python skills and looking for some projects to do, one which came to mind was experimentally modeling in python. What I like to do is code an experiment say period of a pendulum and then compare the model to some data obtained in the lab, issue is I don't have access to a lab nor am I in education anymore and was wondering if anyone knew of any sites like Kaggel that deal with experimental raw data sets?

I did manage to find a couple on google which directed me to a site called Pizza and Chili but seem to be a bit of a dead loss, same with Kaggel only a few and even they were not really what I was looking for if anyone knows of any could they please post a link.

Thanks in advance.
 
Technology news on Phys.org
  • #2
The NOAA has data available both for greenhouse gas concentrations as well as tides and lots of other things. I've mentored excellent student projects for both. See the citations in these two papers for some links:

https://arxiv.org/ftp/arxiv/papers/1812/1812.10402.pdf

https://arxiv.org/pdf/1507.01832.pdf

Unfortunately, I don't know of any central repository, though data in various fields is available through various sources.

Our lab has a number of raw data sets from videos of undergrad types of experiments in kinematics and mechanics. Stuff like this experiment: https://www.physicsforums.com/insights/an-accurate-simple-harmonic-oscillator-laboratory/

We don't intend to publish these data sets or all the videos, but we're willing to send them privately, especially if you're willing to analyze the raw videos using something like Tracker. Send me a PM if interested.
 
  • Like
Likes Taylor_1989
  • #3
Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.
 
  • Like
Likes Taylor_1989
  • #4
You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.
 
  • Like
Likes Taylor_1989 and jedishrfu
  • #5
Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/
 
  • Like
Likes Taylor_1989 and WWGD
  • #6
fresh_42 said:
You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.
jedishrfu said:
Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/

I've mentored a couple student projects using the approach of generating data with a predictive model plus normally distributed random noise. Compared with using real experimental data, the learning process of the student was somewhat suboptimal, so I always figured out how the student could find and use real experimental data also. Even adding noise, computer generated data does not provide a fully authentic learning experience, since real experimental data often has imperfections other than the experimental uncertainties on the dependent variable.

Some of these can be simulated with additional effort. For example, jitter or error can be added to the independent variable as well using the same method used to generate a normally distributed appropriately scaled error for the dependent variable. But even this approach still assumes a data set in which the values of the independent variable are equally spaced, or nearly equally spaced, over an interval. Another approach is to generate random numbers for the independent variable in a given interval representing the anticipated measurement range of a proposed experiment. The point is that real experiments always have uncertainties in both variables, and often the independent variable is not as controlled as well as it is measured.

The learning process when we train aspiring scientists can and should include handing a wide variety of imperfections in experimental data, because that is what experimentalists tend to provide.
 
  • Like
Likes Taylor_1989 and mfb
  • #7
I agree with you @Dr. Courtney. It can be very very difficult to generate realistic looking data.

One time i needed to create a dummy database of customer transaction data like when they clicked on a link, and when they bought stuff. And further to have it appear as multiple customers shopping in real time.

After every attempt, some pattern appeared in the data that could be tracked back to the generating program prompting us to try again. It tested the system we developed but real customer activity found the race condition bugs we were looking for.

In any event, it was a fun coding experience using AWK, weighted arrays plus random indices and SQL to load the database table data. A weighted array is an n element array where the values are repeated ie

1111112222222222222233334555666

So that randomly selecting an element will return a 2 its the most common value in the array.
 
Last edited:
  • Like
Likes Taylor_1989 and Dr. Courtney
  • #8
WWGD said:
Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.
I did check the Data.gov site but didn't really find anything that I was looking specifically, but the UC database is very interesting. Thank for the link.
 
  • Like
Likes jedishrfu and WWGD

FAQ: Find Experimental Datasets for Python Modeling

1. What is the purpose of finding experimental datasets for Python modeling?

The purpose of finding experimental datasets for Python modeling is to have real-world data that can be used to train and test machine learning models. This allows scientists and researchers to analyze and understand complex patterns and relationships in the data, and use these insights to make accurate predictions and decisions.

2. Where can I find experimental datasets for Python modeling?

There are many online sources where you can find experimental datasets for Python modeling, such as data repositories like Kaggle, UCI Machine Learning Repository, and Google Dataset Search. You can also find datasets on various government websites, academic research papers, and data science communities.

3. How do I know if an experimental dataset is suitable for Python modeling?

An experimental dataset is suitable for Python modeling if it is in a structured format, has a sufficient number of data points, and contains relevant features that can be used for the specific modeling task. It is also important to ensure the dataset is reliable and accurately represents the real-world phenomenon being studied.

4. Can I use any dataset for Python modeling?

No, not all datasets are suitable for Python modeling. Some datasets may not have enough data points or may not be relevant to the modeling task at hand. It is important to carefully evaluate the dataset before using it for modeling to ensure it is appropriate for your specific needs.

5. Are there any limitations to using experimental datasets for Python modeling?

Yes, there can be limitations to using experimental datasets for Python modeling. These datasets may not always accurately represent the real-world phenomenon, and there may be biases or errors in the data that can affect the performance of the models. It is important to carefully analyze and preprocess the data before using it for modeling to mitigate these limitations.

Similar threads

Replies
5
Views
3K
Replies
6
Views
3K
Replies
2
Views
2K
Replies
1
Views
1K
Replies
7
Views
7K
Replies
1
Views
1K
Replies
6
Views
3K
Replies
3
Views
2K
Replies
2
Views
2K
Back
Top