Feature Selection for Revolution: Stats or Subject Matter?

  • A
  • Thread starter WWGD
  • Start date
In summary, the conversation discusses the determination of features that are associated with the onset of revolution, specifically focusing on a large population of young people and a pyramid-shaped demographic distribution. The question then arises about whether this determination is based on statistics, subject matter knowledge, or a combination of both. The process for feature selection is also discussed, with the suggestion to test on a subset of data that didn't inform the selection. It is also important to have a hypothesis and reason to believe the feature has a causal relationship. However, it is acknowledged that many studies may have presented false discoveries due to small sample sizes and the influence of other variables. Additionally, examples are given of countries that do not fit the proposed pattern of revolution based on demographics. The
  • #1
WWGD
Science Advisor
Gold Member
7,384
11,345
TL;DR Summary
How to find features that provide a high correlation with a dependent variable.
Hi, I remember reading a paper a while back that argued/proved that a large population of young people ( say <19 y.o or so) and a population pyramid that is thick at the bottom is a necessary feature for the onset of revolution .
** My Question** Is this determination based on Statistics alone, subject matter knowledge or a combination of both? What process would one follow in order to do feature selection if one wanted to determine the \a choice of feature to associate with a given Dependent Variable other than just using basic correlation analysis. Maybe some type of Anova?
 
Physics news on Phys.org
  • #2
WWGD said:
Summary:: How to find features that provide a high correlation with a dependent variable.

Hi, I remember reading a paper a while back that argued/proved that a large population of young people ( say <19 y.o or so) and a population pyramid that is thick at the bottom is a necessary feature for the onset of revolution .
** My Question** Is this determination based on Statistics alone, subject matter knowledge or a combination of both? What process would one follow in order to do feature selection if one wanted to determine the \a choice of feature to associate with a given Dependent Variable other than just using basic correlation analysis. Maybe some type of Anova?
There are a number of ways.
https://scikit-learn.org/stable/modules/feature_selection.html

Always make sure that you test on a subset of data that didn't inform the selection. Or, for example, if you are using the features for a predictive model, you can do feature selection within the cross-validation loop if you're using that, but not before hand on the full data. If your feature selection process include an optimal parameter search, you should do that within an inner/nested cross validation loop. In classical machine learning, often the feature selection process itself is part of the model (the whole pipeline is, including preprocessing, feature selection, and parameter tuning).

In some cases, there are a high number of candidate features, and just searching for the best ones can fail, since there is some chance that fluctuations/noise can by chance produce a distribution showing correlation. In those cases, and in general to some extent, it is important to also have some reason to believe the feature might have a causal relationship, or that the population distributions should show a correlation. That way, you begin with a hypothesis, and a much smaller number of candidates, and you have a better chance that your finding is reliable.

It is believed that a very large subset of statistical research is faulty because of this issue. Different scientific fields/sub-fields are always trying to work towards more robust methodology to avoid these kind of pitfalls. For example, the p-value threshold to be relied on depends strongly on the application. Many, many works have presented false discoveries, or bad results in general due to this issue, for example, by assuming p<0.05 is enough (not to mention the hacking).
 
Last edited:
  • Like
Likes WWGD
  • #3
Don't think statistics will prove anything as the samples are too small and too many other variables. Did the USSR have a revolution in 1991? What about Germany in 1918? If so, there are counter examples. Until the last few decades most every country had a pyramid-shaped demographic period except for periods where war had killed a large number of younger people (like the post-war USSR). The countries over the past 20-30 years without a large number of young people tended to be rich liberal democracies - so is the lack of revolutions in Western Europe due to being rich or old?
 
  • #4
BWV said:
Don't think statistics will prove anything as the samples are too small and too many other variables. Did the USSR have a revolution in 1991? What about Germany in 1918? If so, there are counter examples. Until the last few decades most every country had a pyramid-shaped demographic period except for periods where war had killed a large number of younger people (like the post-war USSR). The countries over the past 20-30 years without a large number of young people tended to be rich liberal democracies - so is the lack of revolutions in Western Europe due to being rich or old?
The idea is that older people usually have other concerns like work and taking care of their families and tend to have more invested in the status quo ( than younger people) and are thus less willing to threaten their station in life by trying to overthrow the system. It may be more likely that a pyramidal distribution increases the odds; but a level of general discontent must prevail too. maybe a GINI coefficent beyond a certain point does too.
 

FAQ: Feature Selection for Revolution: Stats or Subject Matter?

What is feature selection and why is it important?

Feature selection is the process of selecting the most relevant and important features from a dataset to be used in a statistical or machine learning model. It is important because it helps improve the accuracy and efficiency of the model by reducing the number of irrelevant or redundant features.

How do you determine which features to select?

There are various methods for feature selection, including statistical tests, correlation analysis, and machine learning algorithms. The best approach may depend on the specific dataset and the goals of the analysis. It is important to carefully consider the characteristics of the data and the potential impact of each feature on the model's performance.

Should feature selection be based on statistical significance or subject matter expertise?

Ideally, a combination of both statistical significance and subject matter expertise should be used in feature selection. Statistical tests can provide objective measures of feature importance, while subject matter experts can offer valuable insights into the relevance and potential impact of certain features on the model.

Can feature selection be automated?

Yes, feature selection can be automated using various algorithms and techniques. However, it is important to carefully evaluate the results and consider the limitations and potential biases of the automated approach. It is also recommended to involve subject matter experts in the process to ensure the selected features are relevant and meaningful.

What are the potential challenges in feature selection for revolutionizing statistics?

Some potential challenges in feature selection for revolutionizing statistics include dealing with large and complex datasets, selecting the most appropriate feature selection method, and ensuring the selected features are not biased or misleading. It is also important to consider the potential impact of feature selection on the overall analysis and to carefully validate the results.

Back
Top