Boosted Decision Trees algorithm

In summary, the conversation discusses the use of Boost Decision Tree (BDT) method in particle physics, specifically in identifying neutral pions in ATLAS. The method involves training with signal and background samples by applying one-dimensional cuts on discriminating variables. This process is repeated until a minimum number of events is reached in a certain subsample and the resulting objects are classified as signal or background. The algorithm also incorporates weighting of wrongly classified candidates and stops when a pre-defined number of trees is reached. This is necessary to avoid drawing conclusions based on random fluctuations in the data.
  • #1
ChrisVer
Gold Member
3,378
465
I am not sure whether it should be here, or in statistical mathematics or in computers thread...feel free to move it. I am using it here because I am trying to understand the algorithm when it's used in particle physics (e.g. identification of neutral pions in ATLAS).

As I read it:

In general we have a some (correlated) discriminating variables and we want to combine them into one more powerful discriminant. For that we use the Boost Decision Tree (BDT) method.
The method is trained with an input of signal (cluster closest to each [itex]\pi^0[/itex] with some selection of the cluster-pion distance [itex]\Delta R <0.1[/itex]) and background samples (the rest).
The training starts by applying a one-dimensional cut on the variable that provides the best discrimination of the signal and background samples.
This is subsequently repeated in both the failed and succeeded sub-samples using the next more powerful variable, until the number of events in a certain subsample has reached a minimum number of objects.
Objects are then classified as signal or background dependent on whether they are in a signal or background-like subsample. The result defines a tree.
The process is repeated weighting wrongly classified candidates higher (boosting them), and it stops when the number of pre-defined trees has been reached. And the output is the likelihood estimator of whether the object is signal or background.

My questions:
Taking a discrimination for the best discriminating variable, the algorithm will check whether some variable cut is satisfied and will lead to YES or NO.
Then to the resulting two boxes, another check is going to be done with some other variable. And so on... However I don't understand how this can stop somewhere and not keep going until all the variables are checked (I don't understand then the: "until the number of events in a certain subsample has reached a minimum number of objects").
A picture of what I am trying to explain is shown below (the weird names are just the names of the variables, S:signal, B:background):
planck_cmb.jpg
 
Physics news on Phys.org
  • #2
If you just have 100 events (random number) left in a category, your statistical fluctuations are too large to draw reasonable conclusions. You would just define your cuts based on random fluctuations and the resulting S/(S+B) values would look much better than they really are.
 

FAQ: Boosted Decision Trees algorithm

What is a Boosted Decision Trees algorithm?

A Boosted Decision Trees algorithm is a machine learning technique used for classification and regression tasks. It combines multiple decision trees to create a more accurate and robust model.

How does a Boosted Decision Trees algorithm work?

A Boosted Decision Trees algorithm works by sequentially adding decision trees to the model, with each subsequent tree correcting the errors of the previous ones. It uses a technique called gradient boosting to optimize the model and minimize errors.

What are the advantages of using a Boosted Decision Trees algorithm?

Some advantages of using a Boosted Decision Trees algorithm include its ability to handle both numerical and categorical data, its resistance to overfitting, and its high accuracy even with large and complex datasets.

Are there any limitations to using a Boosted Decision Trees algorithm?

One limitation of using a Boosted Decision Trees algorithm is that it can be computationally expensive, especially when dealing with large datasets. It also requires careful tuning of parameters to avoid overfitting.

In which applications is a Boosted Decision Trees algorithm commonly used?

A Boosted Decision Trees algorithm is commonly used in applications such as fraud detection, recommendation systems, and customer churn prediction. It is also used in industries like finance, healthcare, and marketing for data analysis and predictions.

Similar threads

Replies
3
Views
2K
Replies
15
Views
2K
Replies
16
Views
2K
Replies
13
Views
2K
Replies
1
Views
3K
Back
Top