Filling in Missing Values in a string of data

In summary, you collected data for an automatic transformer and estimated the missing values for June and July.
  • #1
Brian82784
19
0
TL;DR Summary
Finding a missing value for trending and analysis
hello,

I’m trying to figure out if I’m doing this correctly or if there’s a different way that I should be finding a missing value. I’m trending data for an automatic transformer. Every month I collect the operations counter value and at the end of the year sum the number of tap changes. Sometimes I can’t get to the transformer so there are months where I don’t Collect the count.

For example:

Operations count
Jan = 654104
Feb = Missing
Mar = 659176

To find approximate missing value take 654104 + 659176 = 1313280 and dividing 1313280/2 = 656640

Now it would look like
Jan = 654104
Feb = 656640
Mar = 659176

I will get into a bit more down below.

Operations Counter

Jan = 654104
Feb = missing (populate with calculated 656640)
Mar = 659176
April= 661476
May= 664376
June= missing
July = missing
August = 671976
September= 675076

I then find the total changes month to month

Total tap Changes
Jan - Feb = 2536
Feb- March = 2536
March-April= 2300
April-May= 2900

After I have a years worth of data I sum the total changes. To see how much it operated in a year. So above from Jan - May would be a total of 10,272 operations (I know it’s not. Full year just showing an example).
 
Mathematics news on Phys.org
  • #2
It's not clear what your question is. Are you asking if there is a better way to estimate the missing values versus simple interpolation?
 
  • #3
berkeman said:
It's not clear what your question is. Are you asking if there is a better way to estimate the missing values versus simple interpolation?
Just wondering if this is the best/right way or if I’m completely wrong.
 
  • #4
What you did to estimate the missing Feb data is called linear interpolation, which amounts to getting the average of the two surrounding months. I'm not sure what you did to find the missing data for June and July.

Given that the counts are increasing month by month, another option would be to fit the data points to a curve rather than a straight line. In this case a second-degree polynomial (a parabola) might be a good choice. That is, the curve whose equation is ##y = ax^2 + bx + c## that minimizes the vertical distance between points on the curve and your data.
 
  • Like
Likes Vanadium 50 and FactChecker
  • #5
I would be inclined to keep it simple (linear interpolation) unless there was a significant reason to get more complicated (higher-degree polynomial). The linear interpolation would also be safer if there is missing data at either end. That would require extrapolation of the known data, which can give more extreme estimates from a higher-degree approximation.
 
  • #6
Mark44 said:
What you did to estimate the missing Feb data is called linear interpolation, which amounts to getting the average of the two surrounding months. I'm not sure what you did to find the missing data for June and July.

Given that the counts are increasing month by month, another option would be to fit the data points to a curve rather than a straight line. In this case a second-degree polynomial (a parabola) might be a good choice. That is, the curve whose equation is ##y = ax^2 + bx + c## that minimizes the vertical distance between points on the curve and your data.

I’m not sure how to calculate that could you show me an example?
 
  • #7
What I described is called quadratic regression. Here's an article that shows how to do this in Excel - https://www.statology.org/quadratic-regression-excel/. You need to have Analysis Toolpack installed to do this (it's a free download). If you have some other spreadsheet, there likely is an equivalent set of tools available.
 
  • #8
You need to decide how to attribute the total change over February and March to those two months. The simplest way is to apportion it equally, which is what you did. Alternatively, you can apportion them according to the number of (work)days in each month.

A more accurate approach is to apportion operations to months according to the actual dates of the meter readings, if you have that information.

Regardless of the method, you need to ensure that people who use this data can find out (1) which figures were exact and which had to be estimated, and (2) how the estimation was done.
 

FAQ: Filling in Missing Values in a string of data

What is the purpose of filling in missing values in a string of data?

The purpose of filling in missing values in a string of data is to ensure that the data is complete and accurate. Missing values can lead to biased or incorrect analysis and can affect the overall quality of the data.

What are the different methods for filling in missing values in a string of data?

There are several methods for filling in missing values in a string of data, including mean/median imputation, regression imputation, and hot deck imputation. Each method has its own advantages and disadvantages, and the best approach will depend on the type of data and the specific situation.

How do I determine which method to use for filling in missing values in a string of data?

The best method for filling in missing values will depend on several factors, such as the type of data, the amount of missing values, and the purpose of the analysis. It is important to carefully consider these factors and choose the method that will result in the most accurate and unbiased data.

Can filling in missing values in a string of data affect the results of my analysis?

Yes, filling in missing values can have a significant impact on the results of your analysis. It is important to carefully consider the method used for filling in missing values and to evaluate the potential effects on the overall analysis.

Are there any potential drawbacks to filling in missing values in a string of data?

Yes, there are potential drawbacks to filling in missing values in a string of data. One potential drawback is that the filled-in values may not accurately represent the true values that were missing. Additionally, the process of filling in missing values may introduce bias into the data. It is important to carefully evaluate the potential drawbacks and consider alternative methods if necessary.

Back
Top