Nonlinear Regression: Getting Started with X to Predict Y

In summary, the conversation revolves around finding an equation to accurately predict the Y value for a given X value in a 2D data-set. The discussion includes the use of regression, particularly polynomial regression, and the possibility of using programs like CurveExpert or Excel to obtain the equation. The main goal is to use the equation to find the average height of children between the ages of 4 and 17, as is done with percentile charts in the medical field. The conversation also touches on the importance of understanding the process of curve-fitting and the use of linear algebra in obtaining the unknown coefficients of the polynomial equation.
  • #1
kenewbie
239
0
Ok, so I am trying to find an equation to match a 2D data-set (x,y) positions. I have X and I want to use an equation to predict Y to a rather accurate degree.

As far as I can understand, I need to use some form of regression (non-linear, since the data is awfully curved).

Now, I have no knowledge of anything above pre-calculus at the moment, so I gather I will have to do some reading to get where I want.

So, I'm looking for help getting started. General mathematical techniques that would be helpful, algorithms, books on the subject, and so on.

Any and all tips are appreciated.

k
 
Mathematics news on Phys.org
  • #2
Well, can you describe the data for us? Or even post it?

1. Simple polynomial regression isn't that hard to do: for someone who knows pre-calc, the toughest part is keeping track of the variables. (There is a step where you'll take a derivative, but just of a polynomial.)
2. You could find a program to do it for you... CurveExpert comes to mind.
3. Excel could be useful as well, either as a computational assistant (making #1 easier), or, with the analysis tookpack, to just do the regression.
 
  • #3
I can give an example of the data:

4 104.3
5 112.0
6 119.0
7 124.5
8 130.0
9 135.0
10 140.0
11 145.2
12 151.5
13 157.5
14 164.5
15 172.0
16 175.3
17 178.0

This is age in years and avg. height in centimeters of Norwegian boys aged 4 - 17.

However, I am not looking to do a one-time fitting to a certain set of data. I want to do on-the-fly fitting to any set of similar data. And yes, I have tried a couple of programs which get decent results (a 5th degree polynomial which matches the example data-set fairly well, the errors are < 1 in all cases and often 0 or 0.1).

I want to make the program that makes the fitting, so to speak.

I have been looking at linear regression, which looks easy enough and is fairly understandable, but it is not accurate enough (seeing as the data I am working with is mostly related to human growth, the curves tend to look cubical at the very least, with spurts at childhood and puberty).

Do you happen to know any good sources of information on polynomial regression?

k
 
  • #4
A 5th-degree polynomial fit to the above data would be
[tex]y=a+bx+cx^2+dx^3+ex^4+fx^5[/tex] for
a = 72.736866
b = 4.4251962
c = 1.9980728
d = -0.37948014
e = 0.026680369
f = -0.00064201794

which does fit pretty well with the data points: 15 is the worst fit, off by 0.903 cm. But extrapolation with this model is nearly impossible, as it would have 18 year olds shrink to shorter than 16 year olds and 21 year olds to less than 4 year olds!

So an important question: what do you plan to do with the results of the equation?
 
  • #5
I have found an 9th-degree that matches very well. Worst fit is off by 0.4, which is within my margin of error. I have a feeling I can get a match that is just as good using methods that require less CPU-cycles, but speed is not an issue (within reason).

I simply want to use the equation to find the avg height given an age between 4 and 17. I don't want to store the data itself, since the age can be float. (What do we call float in math again? The reals?) I have the "end points" on every data-set, so I don't need to extrapolate anything.

I'm not sure where you are located, but where I am from (Norway) doctors use something we call "Percentile Charts" (loosely translated) to track the development of children. The charts have data for length, weight, circumference of head and so on, and quickly tells the doctor what percentile the kid is in. I know many countries use these.

An example chart: http://www.cdc.gov/nchs/data/nhanes/growthcharts/set1clinical/cj41l020.pdf

You can find more of them here: http://kidshealth.org/parent/growth/growth/growth_charts.html

If you look carefully at them, you can see that IE, the line for the 75th percentile is not parallel, perpendicular or even equidistant in both directions from the 50th percentile. So, I need all the data on every single "percentile-line", from every chart in every category.

Now, if I want a better "resolution" than year by year, that is quite a lot of data points to store, seeing as I have to make this in javascript. So, rather than fill a file with endless arrays and work from those, I want to model the data.

You punch in 11.5 kg and 80 cm for a girl, and the program tells you that this is in the 85th percentile by plugging the numbers into the appropriate equation. I don't need to accept data that would be outside the percentile charts, so I never have to worry about extrapolation.

Now, I CAN do this by using something like CurveExpert, as you suggested. But, while working on this I have come to find the concept of curve-fitting to be very intriguing, so I want to know what I am doing. I want to learn how to fit a polynomial to a set of data. I don't expect to get better results than a program does by doing it myself, but I want to understand the process and be able to do it myself.

This became a lot longer than I intended.

k

That's about as much information as I can give :)
 
  • #6
Kenewbie,

If you can understand just the simplest row-reduction process of linear algebra, this would seem to be what you are interested in knowing. Then, you can just use CurveExpert to obtain the unknown coefficients. One would really not want to perform the row-operations manually, or even manually with the aid of an electronic calculator, since this would be far too time-consuming.

Your polynomial equations, although not really linear equations, can ultimately be used as linear equations. Your unknown will be the coefficients; and the known pieces will be the (x, y) points. Do you see this? If you do not see this, then tell us. As I said, the actual unknowns are the coefficients for the polynomial equation form.

You may want to decide what degree polynomial to choose, depending on how good each different fit represents your set of data. Your description suggests that maybe each group of data will need its own polynomial equation fit.
 
  • #7
symbolipoint said:
If you can understand just the simplest row-reduction process of linear algebra, this would seem to be what you are interested in knowing. Then, you can just use CurveExpert to obtain the unknown coefficients. One would really not want to perform the row-operations manually, or even manually with the aid of an electronic calculator, since this would be far too time-consuming.

Linear algebra is still a bit above my competence-level, I am afraid. I understand that the fitting is tedious to do by hand (with a calculator) but I can make a program to do it for me as long as I understand how it works. Will I be able to pick up and understand the bits of linear algebra that I need for this exact problem with just precalc as a basis?

symbolipoint said:
Your polynomial equations, although not really linear equations, can ultimately be used as linear equations. Your unknown will be the coefficients; and the known pieces will be the (x, y) points. Do you see this?

I think so. Even though my points are not linear, I build the coefficients for my polynomial by finding the values that would alter the points to MAKE them linear, so to speak. That was clever.

symbolipoint said:
You may want to decide what degree polynomial to choose, depending on how good each different fit represents your set of data. Your description suggests that maybe each group of data will need its own polynomial equation fit.

I am 100% certain that they will need individual equations. My acceptable margin of error is +/- 0.5, which means they need to be quite specific.

k
 
  • #8
Be careful of reading too much into curve fitting. For example, if I had 17 data points, I could fit them all *perfectly* (no errors anywhere) with a 17th degree polynomial. That's just an artifact of the fact that I have 17 equations and 17 unknowns, but doesn't give any insight into the underlying causes, and is pretty useless for extrapolation.
 
  • #9
maze said:
Be careful of reading too much into curve fitting. For example, if I had 17 data points, I could fit them all *perfectly* (no errors anywhere) with a 17th degree polynomial. That's just an artifact of the fact that I have 17 equations and 17 unknowns, but doesn't give any insight into the underlying causes, and is pretty useless for extrapolation.

Good point, thank you. I sort of thought of something similar. I will probably want to use the lowest possible degree that is within my error margin, since more degrees means that the equation will be more erratic between my data-points, right?

I thought I would get some plotter program and "zoom in" on the equations before I actually started using them, to make sure the are somewhat smooth between my points.

k
 
  • #10
It sounds to me like you need polynomial (probably quadratic or cubic) interpolation, or else a spline fit. That way you won't have the catastrophic fluctuations of the other models.
 
  • #11
kenewbie said:
Linear algebra is still a bit above my competence-level, I am afraid. I understand that the fitting is tedious to do by hand (with a calculator) but I can make a program to do it for me as long as I understand how it works. Will I be able to pick up and understand the bits of linear algebra that I need for this exact problem with just precalc as a basis?



I think so. Even though my points are not linear, I build the coefficients for my polynomial by finding the values that would alter the points to MAKE them linear, so to speak. That was clever.



I am 100% certain that they will need individual equations. My acceptable margin of error is +/- 0.5, which means they need to be quite specific.

k

kenewbie,

You begin seeing Linear Algebra in a very simplified way towards the end of Introductory Algebra. Some courses of Intermeidate Algebra take Linear Algebra, still fairly simplified, a little further. Certainly some Pre-Calculus or College Algebra courses will give a somewhat stronger treatment of Linear Algebra. Either of those courses may give you enough Linear Algebra treatment for you to understand elementary row operations; so very likely at the end of one year of studying Algebra, you would be able to understand enough to know how you could solve a system of linear equations. If your courses did not treat the topic of Linear Algebra well enough (or skipped the topic), you can find a good used book of Intermediate Algebra and study the topic on your own.

You could use a polynomial equation system in the form of a system of linear equations like this:
Imagine you have a set of cubic equations, third degree polynomials. Each would be in the form (excuse my lack of typesetting), y = a*x^3 + b*x^2 + c*x + d. Notice that the form of equation has two variables, x and y. The equation represents a cubic, and you would seem to need two such equations. In practice, this would not serve.
Now realize that you will have actual data points WHICH YOU KNOW, let's us say (x1, y1), (x2, y2), (x3, y3), and (x4, y4). These are not variables. They are ordered pairs of KNOWN values. Each one of them can be substituted into your basic cubic equation and the computations performed. Now, each x and y used in each eqaution gives you something which really takes the form of a linear equation. Do you now see that in the form, y = a*x^3 + b*x^2 + c*x + d, the unknowns are really the coefficients a, b, c, and d ? Now, you can form your needed system of FOUR EQUATIONS. See? Four equations and four unknowns. You will solve for the unknown coefficients.

You could try to use the simple methods of solving a system as you would do in Intro Algebra, or you could similarly using matrices as you might in Intermed Algebra... or you can simply put the point into a program such as CurveExpert, and voke the data points into whatever polynomial fit gives you the results you want. Note, you will need as many data points as the number of unknowns in you equation.
 

FAQ: Nonlinear Regression: Getting Started with X to Predict Y

What is nonlinear regression?

Nonlinear regression is a statistical method used to model the relationship between a dependent variable (Y) and one or more independent variables (X) that does not follow a linear pattern. It involves finding the best-fit curve that describes the relationship between the variables, rather than a straight line.

What types of data can be analyzed using nonlinear regression?

Nonlinear regression can be used to analyze any type of data, as long as the relationship between the variables can be described by a nonlinear function. This includes data from various fields such as biology, economics, engineering, and social sciences.

How is nonlinear regression different from linear regression?

Linear regression assumes a linear relationship between the variables, while nonlinear regression allows for more complex relationships between the variables. Nonlinear regression also involves estimating parameters for the chosen nonlinear function, whereas linear regression simply calculates the slope and intercept of a straight line.

What are some common nonlinear functions used in nonlinear regression?

Some common nonlinear functions used in nonlinear regression include exponential, logarithmic, power, and polynomial functions. The choice of function depends on the nature of the relationship between the variables and the type of data being analyzed.

How is the best-fit curve determined in nonlinear regression?

The best-fit curve in nonlinear regression is determined by minimizing the sum of squared errors between the observed data points and the predicted values from the chosen nonlinear function. This is typically done using a statistical software program or by using optimization techniques such as gradient descent.

Similar threads

Replies
1
Views
1K
Replies
2
Views
1K
Replies
4
Views
2K
Replies
3
Views
2K
Replies
21
Views
879
Replies
7
Views
922
Back
Top