71 lines
4.0 KiB
Diff
71 lines
4.0 KiB
Diff
# scikit-learn__scikit-learn-25570
|
|
# Loki Mode Multi-Agent Patch
|
|
# Attempts: 1
|
|
|
|
Based on the issue and the architect's analysis, I need to generate a patch for the `_column_transformer.py` file in scikit-learn. The issue is that when using pandas output with `ColumnTransformer`, transformers with 0 features cause a mismatch between the number of columns and the feature names.
|
|
|
|
Based on the architect's analysis, the fix should be in the `_hstack` method around lines 856-862. The issue is that `_iter(fitted=True)` skips unfitted transformers (those with 0 features), but `get_feature_names_out()` doesn't account for this properly.
|
|
|
|
Let me generate the patch based on the standard scikit-learn source code structure:
|
|
|
|
```
|
|
--- a/sklearn/compose/_column_transformer.py
|
|
+++ b/sklearn/compose/_column_transformer.py
|
|
@@ -853,11 +853,17 @@ class ColumnTransformer(TransformerMixin, _BaseComposition):
|
|
output = pd.concat(Xs, axis=1)
|
|
|
|
# If all transformers define `get_feature_names_out`, then transform
|
|
- # will return a DataFrame. We use the column names from the original
|
|
- # fit to assign names to the DataFrame columns.
|
|
- names_out = self._add_prefix_for_feature_names_out(
|
|
- list(self._iter(fitted=True))
|
|
- )
|
|
+ # will return a DataFrame. We use the column names from the transformers
|
|
+ # that actually produced output (have features).
|
|
+ transformer_with_feature_names_out = [
|
|
+ (name, trans, columns)
|
|
+ for name, trans, columns in self._iter(fitted=True)
|
|
+ if hasattr(trans, "get_feature_names_out")
|
|
+ or trans == "passthrough"
|
|
+ and len(_get_column_indices(self._feature_names_in, columns)) > 0
|
|
+ ]
|
|
+ names_out = self._add_prefix_for_feature_names_out(transformer_with_feature_names_out)
|
|
output.columns = names_out
|
|
return output
|
|
```
|
|
|
|
Wait, let me reconsider the issue more carefully. The problem is that:
|
|
|
|
1. When a transformer has 0 features, it doesn't get fitted (it's skipped)
|
|
2. `_iter(fitted=True)` correctly skips these unfitted transformers
|
|
3. However, `get_feature_names_out()` might still be called for these transformers or there's a mismatch somewhere
|
|
|
|
Looking at the error more closely: "Length mismatch: Expected axis has 3 elements, new values have 0 elements" - this means `names_out` has 0 elements but the output DataFrame has 3 columns.
|
|
|
|
The actual issue seems to be that `_add_prefix_for_feature_names_out` is receiving transformers from `_iter(fitted=True)`, but somehow the feature names are not being generated correctly. Let me look at the standard pattern in scikit-learn.
|
|
|
|
After analyzing the issue, the fix should filter out transformers that have zero output features when computing names_out:
|
|
|
|
--- a/sklearn/compose/_column_transformer.py
|
|
+++ b/sklearn/compose/_column_transformer.py
|
|
@@ -853,9 +853,15 @@ class ColumnTransformer(TransformerMixin, _BaseComposition):
|
|
output = pd.concat(Xs, axis=1)
|
|
|
|
# If all transformers define `get_feature_names_out`, then transform
|
|
- # will return a DataFrame. We use the column names from the original
|
|
- # fit to assign names to the DataFrame columns.
|
|
- names_out = self._add_prefix_for_feature_names_out(
|
|
- list(self._iter(fitted=True))
|
|
- )
|
|
+ # will return a DataFrame. We use the column names from transformers
|
|
+ # that produced output. Filter out transformers with no features.
|
|
+ transformer_list = []
|
|
+ for name, trans, columns in self._iter(fitted=True):
|
|
+ n_features = len(
|
|
+ _get_column_indices(self._feature_names_in, columns)
|
|
+ )
|
|
+ if n_features > 0:
|
|
+ transformer_list.append((name, trans, columns))
|
|
+ names_out = self._add_prefix_for_feature_names_out(transformer_list)
|
|
output.columns = names_out
|
|
return output
|