Fit_transform() vs. transform()

  • Python
  • Thread starter EngWiPy
  • Start date
  • Tags
    Transform
In summary, the method fit_transform() in scikit-learn is essentially the same as calling fit() and then transform(). It fits the given input data to determine the combinations of features needed and then transforms the data accordingly. This is useful when training a model and testing its performance on different data.
  • #1
EngWiPy
1,368
61
Hi,

I noticed that in some cases we first call fit_transform(), and afterwards we call transform(). Like in the following example:

Code:
from sklearn.preprocessing import PolynomialFeatures

X_train = np.array([6, 8, 10, 14, 18]).reshape(-1, 1)
X_test = np.array([6, 8, 11, 16]).reshape(-1, 1)

quadratic_featurizer = PolynomialFeatures(degree = 2)

X_train_qudratic = quadratic_featurizer.fit_transform(X_train)
X_test_qudratic = quadratic_featurizer.transform(X_test)

Why? What is the difference between the two methods?

Thanks
 
Technology news on Phys.org
  • #2
I haven't used PolynomialFeatures before but fit(), fit_transform(), and transform() are standard methods is scikit-learn.

fit_transform() is essentially the same as calling fit() and then transform() - so is like a shortcut for two commands in one if you wish.

So when you do X_train_qudratic = quadratic_featurizer.fit_transform(X_train) what you are doing is fitting quadratic_featurizer on X_train and using it to transform X_train itself. This should be equal to (and is a shorthand for):

quadratic_featurizer.fit(X_train)
X_train_qudratic = quadratic_featurizer.transform(X_train)


On the other hand, when you do X_test_qudratic = quadratic_featurizer.transform(X_test) you are using a previously fitted quadratic_featurizer to transform X_test. This should fail unless you have previously called either .fit() or .fit_transform on quadratic_featurizer.

Hope it makes sense.

I am guessing what you are trying to do is actually:
quadratic_featurizer.fit(X_train)
X_test_qudratic = quadratic_featurizer.transform(X_test)

Although what you did, e.g.:
X_train_qudratic = quadratic_featurizer.fit_transform(X_train)
X_test_qudratic = quadratic_featurizer.transform(X_test)


will also work but you are unnecessarily transforming X_train by calling fit_transform() instead of fit().
 
  • Like
Likes EngWiPy
  • #3
Smile Say Hello said:
I haven't used PolynomialFeatures before but fit(), fit_transform(), and transform() are standard methods is scikit-learn.

fit_transform() is essentially the same as calling fit() and then transform() - so is like a shortcut for two commands in one if you wish.

So when you do X_train_qudratic = quadratic_featurizer.fit_transform(X_train) what you are doing is fitting quadratic_featurizer on X_train and using it to transform X_train itself. This should be equal to (and is a shorthand for):

quadratic_featurizer.fit(X_train)
X_train_qudratic = quadratic_featurizer.transform(X_train)


On the other hand, when you do X_test_qudratic = quadratic_featurizer.transform(X_test) you are using a previously fitted quadratic_featurizer to transform X_test. This should fail unless you have previously called either .fit() or .fit_transform on quadratic_featurizer.

Hope it makes sense.

I am guessing what you are trying to do is actually:
quadratic_featurizer.fit(X_train)
X_test_qudratic = quadratic_featurizer.transform(X_test)

Although what you did, e.g.:
X_train_qudratic = quadratic_featurizer.fit_transform(X_train)
X_test_qudratic = quadratic_featurizer.transform(X_test)


will also work but you are unnecessarily transforming X_train by calling fit_transform() instead of fit().

It makes sense. I did X_train_qudratic = quadratic_featurizer.fit_transform(X_train) because later in my code I use X_train_quadratic to train a model using .fit() and then test the performance of the model on X_test_quadratic.

I have one question: the method .fit_transform() fits what to the training data X_train? For example if
X_train = [1
2
3
4]
the quadratic_featurizer.fit_transform(X_train) will result in
[ 1 1 1
1 2 4
1 3 9
1 4 16]
which is basically the value of the independent variable x_1 in the polynomial equation
[tex]y = \beta_0 + \beta_1x_1 + \beta_2x_1^2[/tex]

In this case .fit_transform() fits what to X_train?

Thanks
 
  • #4
S_David said:
It makes sense. I did X_train_qudratic = quadratic_featurizer.fit_transform(X_train) because later in my code I use X_train_quadratic to train a model using .fit() and then test the performance of the model on X_test_quadratic.

I have one question: the method .fit_transform() fits what to the training data X_train? For example if
X_train = [1
2
3
4]
the quadratic_featurizer.fit_transform(X_train) will result in
[ 1 1 1
1 2 4
1 3 9
1 4 16]
which is basically the value of the independent variable x_1 in the polynomial equation
[tex]y = \beta_0 + \beta_1x_1 + \beta_2x_1^2[/tex]

In this case .fit_transform() fits what to X_train?

Thanks
 
  • #5
It looks like it is not actually fitting anything-- I think it is called fit_transform() simply because scikit-learn tries to provide a uniform interface and a lot of other modules in scikit-learn use the same terminology. What fit() and the fit part of fit_transform() seems to do is simply determine the combinations of features it needs to return for the given input shape. So when you later call transform many times, it can skip that part and simply return the values.

So, in this case the fit() part figures that it is a single feature and x^0,x^1, and x^2 need to be returned and the transform() part simply returns them for each sample on that basis.
 
  • Like
Likes EngWiPy
  • #6
Smile Say Hello said:
It looks like it is not actually fitting anything-- I think it is called fit_transform() simply because scikit-learn tries to provide a uniform interface and a lot of other modules in scikit-learn use the same terminology. What fit() and the fit part of fit_transform() seems to do is simply determine the combinations of features it needs to return for the given input shape. So when you later call transform many times, it can skip that part and simply return the values.

So, in this case the fit() part figures that it is a single feature and x^0,x^1, and x^2 need to be returned and the transform() part simply returns them for each sample on that basis.

Thanks for your replies. It is more clear now.
 

FAQ: Fit_transform() vs. transform()

What is the difference between fit_transform() and transform() in machine learning?

Answer: Fit_transform() is a method used to transform data into a new form for machine learning models to use. It is a combination of two steps: fitting the model to the data and then transforming the data using the fitted model. Transform(), on the other hand, only applies the transformation to the data without fitting the model.

When should I use fit_transform() and when should I use transform()?

Answer: Fit_transform() is typically used when building a machine learning model for the first time or when the data has not been preprocessed. This method will both fit the model to the data and transform the data into a usable format. Transform() is used when the model has already been fitted to the data and the same transformation needs to be applied to new data.

Can I use transform() without using fit_transform() first?

Answer: Yes, you can use transform() without using fit_transform() first. However, it is important to note that transform() will not fit the model to the data, so it should only be used when the model has already been fitted.

Is fit_transform() necessary for all machine learning models?

Answer: No, fit_transform() is not necessary for all machine learning models. Some models, such as decision trees, do not require data transformation and can be directly trained on the raw data. However, other models, like neural networks, may require data transformation in order to improve performance.

What happens if I use fit_transform() on new data?

Answer: Using fit_transform() on new data will result in an error. This method should only be used on the training data and then transform() should be used on new data. This is because fit_transform() fits the model to the training data and the fitted model cannot be applied to new data.

Similar threads

Replies
3
Views
5K
Replies
1
Views
731
Replies
2
Views
982
Replies
18
Views
2K
Replies
2
Views
3K
Back
Top