Generating a probability density function

In summary, the speaker is asking for help in creating a probability density function from a data set with categorical, continuous, and possibly missing data. They are seeking advice on how to approach this problem and mention a post on Stack Exchange Mathematics where they have asked for assistance. They also mention the need to make assumptions in order to solve this problem.
  • #1
mrb427
9
0
I am trying to create a simple implementation of the Bayes decision rule with minimum error criterion and I am running into a problem. Specifically, if I have a data set consisting of a number of feature vectors stored in rows, how can I generate a probability density function from this data?

Also, how can I do this if some of the data is discrete, some is continuous, and some is missing? For example, let us assume each feature vector, x, has three elements.

x = [ a, b, c]

where;

a is categorical data and will be an element of the set {0, 1, 2, 3}
b is continuous data and will be in the range [0,1]
c is also continuous data in the range [0,1], but may be missing for some feature vectors
I want to be able to calculate the likelihood of a feature vector, x, based on the total data set or given that x is from a subset, w, of the total data set.

p(x) = ? and p(x|w) = ?

I have also posted this on Stack Exchange Mathematics, here:
http://math.stackexchange.com/quest...sity-function-from-a-set-of-multivariate-data

I would really appreciate if someone can help me out or point me in the right direction! :biggrin:
 
Physics news on Phys.org
  • #2
mrb427 said:
Specifically, if I have a data set consisting of a number of feature vectors stored in rows, how can I generate a probability density function from this data?

When you don't have enough information to solve a problem, a standard technique is to assume a specific model for the data, a model that has a few unknown parameters. Then estimate the parameters from the data.

There's no use pretending that "I don't make any assumptions". Whatever you do, you'll end up having to make assumptions of some sort because even a simple data set does not determine a unique probability density function unless you make assumptions.

Treating situations where data is missing is known as dealing with "censored data". If you search on those keywords, you might find something that applies to your problem. To get suggestions for a plausible model for your data, I think you have to reveal more details about it.
 

Related to Generating a probability density function

1. What is a probability density function (PDF)?

A probability density function is a mathematical function that describes the likelihood of a random variable taking on a certain value. It is often used in statistics and probability to analyze and predict outcomes of experiments or events.

2. How is a PDF different from a probability mass function (PMF)?

While both PDFs and PMFs describe the probabilities of certain outcomes, PDFs are used for continuous random variables, while PMFs are used for discrete random variables. This means that PMFs give the probabilities for specific values, while PDFs give the probabilities for ranges of values.

3. How do you generate a PDF for a given data set?

To generate a PDF for a given data set, you first need to plot a histogram of the data to visualize its distribution. Then, you can use mathematical formulas or statistical software to fit a probability distribution curve to the histogram. This curve is the PDF for the data set.

4. What is the importance of generating a PDF?

Generating a PDF allows us to understand the distribution of a data set and make predictions about future outcomes. It also allows us to calculate probabilities for specific ranges of values, which can be useful in decision making and risk analysis.

5. Can a PDF be used to calculate the probability of a specific value?

No, a PDF cannot be used to calculate the probability of a specific value. Since PDFs describe the probabilities for ranges of values, the probability of a specific value is always equal to 0. To calculate the probability of a specific value, you would need to use a PMF.

Similar threads

Back
Top