I have a dataset input, which is a list of ~40000 letters (that are represented as strings).
With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sklearn.pipeline
vectorizer = TfidfVectorizer(lowercase=False)
representation1 = vectorizer.fit_transform(input) # TFIDF representation
Now, I want to manually add one feature representation2 for every letter. This feature should tell the amount of different words compared to all words in a specific letter/string:
count_vectorizer = CountVectorizer()
sum_words = np.sum(count_vectorizer.fit_transform(input).toarray(), axis=-1)
sum_different_words = np.count_nonzero(count_vectorizer.fit_transform(input).toarray(), axis=-1)
representation2 = np.divide(sum_different_words, sum_words) # percentage of different words
The array representation2 is now an array of shape (39077,) (as expected). I now want to combine representation1 and representation2 into one feature vector representation.
I read about using FeatureUnion to combine two kinds of features in SKLearn, but I am not sure how to correctly use the Numpy array representation2as a feature here. I tried:
union = sklearn.pipeline.make_union([representation1, representation2])
But now I can't use e.g. union.get_feature_names_out(), since it throws: AttributeError: Transformer list (type list) does not provide get_feature_names_out.
What did I understand incorrectly here?