Issue while inserting count vectorizer results to the dataframe

Question

I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the shape (4237, 25) but am getting as (5524, 25). Am not able to understand the issue.

Code which I have used.

social_media_vectorizer = CountVectorizer(lowercase=True)

train_social_media_vector = social_media_vectorizer.fit_transform(x_train["social_media"].values.astype("U"))
test_social_media_vector = social_media_vectorizer.transform(x_test["social_media"].values.astype('U'))

print(x_train.shape)
print(x_test.shape)

train_social_media_df = pd.DataFrame(train_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
test_social_media_df = pd.DataFrame(test_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
x_train = pd.concat([x_train, train_social_media_df], axis=1)
x_test = pd.concat([x_test, test_social_media_df], axis=1)

print("="*100)
print(x_train.shape)
print(x_test.shape)

print("="*100)
print(social_media_vectorizer.vocabulary_)

Result

(4237, 19)
(1816, 19)
====================================================================================================
(5524, 25)
(3058, 25)
====================================================================================================
{'facebook': 0, 'linkedin': 2, 'twitter': 4, 'instagram': 1, 'youtube': 5, 'producthunt': 3}

Corralien · Accepted Answer · 2021-12-14 07:56:54Z

1

Are you sure the shape of train_social_media_vector.todense() is (4237, 6)? It's seems to be (1287, 6)

Try to ignore_index=True:

x_train = pd.concat([x_train, train_social_media_df], axis=1, ignore_index=True)
x_test = pd.concat([x_test, test_social_media_df], axis=1, ignore_index=True)

edited Dec 14, 2021 at 7:56

answered Dec 14, 2021 at 7:50

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Satyam Anand Over a year ago

Yeah, it is (4237,6). If it would have been (1287,6), then 1287 would be added to test data as well which will make it to (3103,25), right?

Corralien Over a year ago

What is the shape of your original dataframe (before train_test_split)?

Satyam Anand Over a year ago

It is (6053, 29). I have dropped some columns which are not useful.

Corralien Over a year ago

And try to replace todense by toarray

Corralien Over a year ago

OK. I think I understood. Try:

train_social_media_df = pd.DataFrame(train_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out(), index=xtrain.index)

. Now you can concat without ignore_index.

|

Brandon · Accepted Answer · 2021-12-14 07:56:38Z

0

Check indexes of x_train and x_test before doing concat. I assume they have different indexes than the other ones. All rows are joined by the same index when doing concatenation. Missing rows will be filled with NaNs by default. If you do not care about indexes at all, simply drop them with .reset_index(drop=True) before doing concat, or ignore them with ignore_index=True in calling pd.concat(). See @Corralien 's answer above.

answered Dec 14, 2021 at 7:56

Brandon

7587 silver badges15 bronze badges

Collectives™ on Stack Overflow

Issue while inserting count vectorizer results to the dataframe

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related