1

I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the shape (4237, 25) but am getting as (5524, 25). Am not able to understand the issue.

Code which I have used.

social_media_vectorizer = CountVectorizer(lowercase=True)

train_social_media_vector = social_media_vectorizer.fit_transform(x_train["social_media"].values.astype("U"))
test_social_media_vector = social_media_vectorizer.transform(x_test["social_media"].values.astype('U'))

print(x_train.shape)
print(x_test.shape)

train_social_media_df = pd.DataFrame(train_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
test_social_media_df = pd.DataFrame(test_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
x_train = pd.concat([x_train, train_social_media_df], axis=1)
x_test = pd.concat([x_test, test_social_media_df], axis=1)

print("="*100)
print(x_train.shape)
print(x_test.shape)

print("="*100)
print(social_media_vectorizer.vocabulary_)

Result

(4237, 19)
(1816, 19)
====================================================================================================
(5524, 25)
(3058, 25)
====================================================================================================
{'facebook': 0, 'linkedin': 2, 'twitter': 4, 'instagram': 1, 'youtube': 5, 'producthunt': 3}

2 Answers 2

1

Are you sure the shape of train_social_media_vector.todense() is (4237, 6)? It's seems to be (1287, 6)

Try to ignore_index=True:

x_train = pd.concat([x_train, train_social_media_df], axis=1, ignore_index=True)
x_test = pd.concat([x_test, test_social_media_df], axis=1, ignore_index=True)
Sign up to request clarification or add additional context in comments.

6 Comments

Yeah, it is (4237,6). If it would have been (1287,6), then 1287 would be added to test data as well which will make it to (3103,25), right?
What is the shape of your original dataframe (before train_test_split)?
It is (6053, 29). I have dropped some columns which are not useful.
And try to replace todense by toarray
OK. I think I understood. Try: train_social_media_df = pd.DataFrame(train_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out(), index=xtrain.index). Now you can concat without ignore_index.
|
0

Check indexes of x_train and x_test before doing concat. I assume they have different indexes than the other ones. All rows are joined by the same index when doing concatenation. Missing rows will be filled with NaNs by default. If you do not care about indexes at all, simply drop them with .reset_index(drop=True) before doing concat, or ignore them with ignore_index=True in calling pd.concat(). See @Corralien 's answer above.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.