How do I resolve vectorizer mismatch

I am using TfidfVectorizer for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity.

My Situation looks like: firstly,

def clean_text(text):
    return re.sub(r'[^a-zA-Z0-9 ]', "", text)

movies['title'] = movies['title'].apply(clean_text)

vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words ='english')

title_vec = vectorizer.fit_transform(movies['title'])

title = "Toy Story"

title = clean_text(title)

word_vec  = vectorizer.transform([title])

similarity = cosine_similarity(word_vec, title_vec)

which results in error message:

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 172412 while Y.shape[1] == 156967

PS: I have checked the len of the word_vec and title_vec, they show differing lengths. I set the ngram_range=(1,1) in the vectorizer yet no positive result. I used countvectorizer() but the issue remains

I was out of options and chatGPT provided a solution that didn't solve the problem:

from scipy.sparse import hstack

Pad smaller matrix with zeros

if word_vec.shape[1] > title_vec.shape[1]:
    diff = word_vec.shape[1] - title_vec.shape[1]
    title_vec = hstack([title_vec, np.zeros((title_vec.shape[0], diff))])
elif title_vec.shape[1] > word_vec.shape[1]:
    diff = title_vec.shape[1] - word_vec.shape[1]
    word_vec = hstack([word_vec, np.zeros((word_vec.shape[0], diff))])

so i could not use the code above but i am putting it here to show the extent of this problem.

edited Nov 28, 2024 at 13:02

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

asked Nov 27, 2024 at 18:38

KIZ-MAN

334 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How do I resolve vectorizer mismatch

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest