Does CountVectorizer().fit_transform() preserve order of input?

Question

I'm wondering if, when I use CountVectorizer().fit_transform(), the output preserves the order of the input.

My input is a list of documents. I know that the output matches the input in terms of the length, but I'm not sure if they are ordered the same way.

I understand that I might not be explaining it very well, so here's an example.

Say if I have:

input = ["<text_1>", "<text_2>", "<text_3>"]
a = CountVectorizer().fit_transform(input)

Will the indexes correspond as though order is preserved?

For example, in:

  (0, 33)   1
...
  (0, 42)   8
...
  (385, 58) 1
  (385, 51) 6

Is (0, 33) 1 eqivalent to input[0], or (385, 58) 1 to input[365] ?

when you say "order" do you mean order of documents of order of words? — fshabashev
– fshabashev, Commented May 3, 2022 at 12:02
Order of documents. I've tried to clarify in my main question. — rookinn
– rookinn, Commented May 3, 2022 at 12:02

Arne · Accepted Answer · 2022-05-03 12:08:56Z

1

Yes, the row order is preserved. This must be true for all scikit-learn transformation methods, because a common workflow is to split your data into a feature matrix X and a target vector y, where each row of the matrix corresponds to one element of the vector. When you transform X, you must still be able to train the model on the transformed X paired with y, so the order must be preserved.

answered May 3, 2022 at 12:08

Arne

10.6k2 gold badges22 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Does CountVectorizer().fit_transform() preserve order of input?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related