1

I am using TfidfVectorizer for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity.

My Situation looks like: firstly,

def clean_text(text):
    return re.sub(r'[^a-zA-Z0-9 ]', "", text)

movies['title'] = movies['title'].apply(clean_text)

vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words ='english')

title_vec = vectorizer.fit_transform(movies['title'])

title = "Toy Story"

title = clean_text(title)

word_vec  = vectorizer.transform([title])

similarity = cosine_similarity(word_vec, title_vec)

which results in error message:

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 172412 while Y.shape[1] == 156967

PS: I have checked the len of the word_vec and title_vec, they show differing lengths. I set the ngram_range=(1,1) in the vectorizer yet no positive result. I used countvectorizer() but the issue remains

I was out of options and chatGPT provided a solution that didn't solve the problem:

from scipy.sparse import hstack

Pad smaller matrix with zeros

if word_vec.shape[1] > title_vec.shape[1]:
    diff = word_vec.shape[1] - title_vec.shape[1]
    title_vec = hstack([title_vec, np.zeros((title_vec.shape[0], diff))])
elif title_vec.shape[1] > word_vec.shape[1]:
    diff = title_vec.shape[1] - word_vec.shape[1]
    word_vec = hstack([word_vec, np.zeros((word_vec.shape[0], diff))])

so i could not use the code above but i am putting it here to show the extent of this problem.

0

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.