Is there a statistically significant increase in phrase occurrences?

In summary, the person is new to the forum and is asking for advice on how to determine a daily signal based on the number of occurrences of specified phrases in a news data feed. They first tried using a moving average, but found it ineffective due to the bursty nature of the phrase counts. They are now asking for suggestions on what type of statistics to use in order to determine a statistically significant event in their scenario. One suggestion given is to calculate the historical average and standard deviation and test whether the current count is significantly higher than the average plus two standard deviations.
  • #1
kmrstats
2
0
Hi -

First timer here. Excuse me if this question is not up to the level i see posted on this forum, but here goes.

I have been asked to provide a daily signal generated from the number of occurrences of a set of specified phrases present in a news data feed. The first thing I did is generate a moving average from the daily count of each phrase in the feed and generate a signal if the current count was above the moving average by a specified percentage. Using this approach I didn't think the signal provided much value beacuse the phrase counts are very bursty. The count can be in the low teens for a number of days in a row and then jump to a 100 for a couple of days and then settle back into the low teens.

What type of statistics should I use to determine a statistically significant event given my scenario described above?

Thanks in advance
 
Physics news on Phys.org
  • #2
One way is to:
1. calculate the historical average up to day t: HA(t) = [itex]\left.\sum_{s=1}^t n_s\right/t[/itex], where ns is the number of occurrences on day s
2. calculate the historical standard deviation HSD(t) similarly
3. test whether nt is > HA(t) + 2 HSD(t).
 
Last edited:

FAQ: Is there a statistically significant increase in phrase occurrences?

What is bursty data?

Bursty data refers to a type of data that has unusual spikes or bursts of activity, rather than being evenly distributed over time. This can be seen in various types of data, such as internet usage, stock market fluctuations, or social media trends.

How is bursty data different from traditional data?

Traditional data is typically characterized by a consistent pattern or distribution over time, while bursty data has irregular spikes and bursts. This can make it more challenging to analyze and interpret.

What statistical methods are used for analyzing bursty data?

Some common statistical methods for analyzing bursty data include time series analysis, burst detection algorithms, and power law modeling. These methods take into account the irregular patterns and bursts in the data to identify trends and patterns.

What are some real-world applications of bursty data analysis?

Bursty data analysis can be applied in a variety of fields, such as social media marketing, network traffic management, and financial forecasting. It can also be useful in understanding customer behavior, predicting demand, and detecting anomalies in data.

How can bursty data be managed and minimized?

There are various techniques for managing and minimizing the effects of bursty data, such as data smoothing, filtering, and compression. Additionally, data can be resampled or aggregated to reduce the impact of spikes and bursts on the overall analysis.

Back
Top