Newest 'countvectorizer' Questions

0 votes

0 answers

26 views

Configuring 'char_wb' for CountVectorsFeaturizer in RASA

I have seen this configuration in default RASA chatbot, why do we need this setting for featurization, and in what specific languages? If I built a chatbot for non-English conversations, for example ...

Minh Khuất Đức

1

asked Dec 5, 2024 at 12:49

1 vote

0 answers

50 views

How do I resolve vectorizer mismatch

I am using TfidfVectorizer for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity. My Situation looks like: firstly, def clean_text(text): return re....

KIZ-MAN

33

asked Nov 27, 2024 at 18:38

0 votes

1 answer

185 views

How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?

I have the following data for training a model to detect whether a sentence is about: a cat or dog NOT about a cat or dog I ran the following code to train a DecisionTreeClassifier() model then view ...

code_to_joy

631

asked Mar 11, 2024 at 10:24

2 votes

1 answer

146 views

Memory Issue: Creating Bigrams and Trigrams with CountVectorizer

I am trying to create a document term matrix using CountVectorizer to extract bigrams and trigrams from a corpus. from sklearn.feature_extraction.text import CountVectorizer lemmatized = dat_clean['...

Kaitlin

83

asked Sep 6, 2023 at 13:57

0 votes

1 answer

145 views

Where does the model pipeline get the data for the bag of words features from?

I am quite new to deep learning but I am working on this little binary text classification experiment. I want to investigate what impact the training data size has on the metrics of the model (does a ...

clowny

1

asked Apr 25, 2023 at 20:35

1 vote

1 answer

307 views

CountVectorizer for number

I have a list of numbers and I want to use CountVectorizer from sklearn.feature_extraction.text import CountVectorizer def x(n): return str(n) sentences = [5,10,15,10,5,10] vectorizer = ...

saraafr

143

asked Apr 12, 2023 at 7:53

0 votes

2 answers

1k views

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names' -- Topic Modeling -- Latent Dirichlet Allocation

I'm trying to follow the example from the link below. https://medium.datadriveninvestor.com/trump-tweets-topic-modeling-using-latent-dirichlet-allocation-e4f93b90b6fe All the code up to this point ...

ASH

20.5k

asked Apr 2, 2023 at 2:21

0 votes

1 answer

178 views

half space (\u200c) don't support in CountVectorizer

in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word . I will be grateful to guide me. thank you. i used "درخت‌های زیبا" in ...

Sedghian

1

asked Feb 4, 2023 at 20:51

2 votes

2 answers

161 views

why smote raise "Found input variables with inconsistent numbers of samples"?

I try to classify emotion from tweet with dataset of 4401 tweet, when i use smaller sample of data (around 15 tweet) everything just work fine, but when i use the full dataset it raise the error of ...

life_student

21

asked Jan 26, 2023 at 22:30

1 vote

0 answers

98 views

CountVectorizer not working in ColumnTransformer

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case: from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.text import ...

Chukwudi

324

asked Jan 4, 2023 at 17:23

1 vote

0 answers

39 views

Create a token pattern based on the concatenation of some given words [duplicate]

I have a list of keywords, for example: keywords = ['airbnb.com', 'booking', 'deliveroo.uk - UK', ...] My goal is to define the parameter token_pattern of CountVectorizer by concatenating all keywords....

LJG

787

asked Dec 7, 2022 at 13:38

0 votes

1 answer

290 views

Custom tokenizer not working in countvectorizer sklearn

I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later. temp_tok = [...

Anuj Chopra

1

asked Dec 6, 2022 at 4:43

1 vote

0 answers

149 views

Combine numpy array with TfidfVectorizer as a joint feature matrix in SKLearn

I have a dataset input, which is a list of ~40000 letters (that are represented as strings). With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1: import numpy as np ...

TiMauzi

236

asked Dec 5, 2022 at 23:42

0 votes

1 answer

244 views

Remove features with whitespace in sklearn Countvectorizer with char_wb

I am trying to build char level ngrams using sklearn's CountVectorizer. When using analyzer='char_wb' the vocab has features with whitespaces around it. I want to exclude the features/words with ...

Ankit Bansal

475

asked Dec 1, 2022 at 8:16

1 vote

1 answer

84 views

Retain original document element index of argument passed through sklearn's CountVectorizer() in order to access corresponding part of speech tag

I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my ...

OLGJ

472

asked Nov 29, 2022 at 8:39

0 votes

1 answer

664 views

how do you get the frequency of the terms generated by tfidf.get_feature_names_out()

After fitting with tfidf, I'm looking at the features that were generated: from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This document ...

james pow

356

asked Nov 28, 2022 at 22:01

0 votes

1 answer

127 views

Incompatible row dimensions when using passthrough in GridSearch over sklearn Pipeline with FeatureUnion

I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; ...

MichaelU

125

asked Nov 7, 2022 at 11:21

1 vote

0 answers

118 views

Unresolved attribute reference 'transform' for class 'object'

I'm having a problem with this CV code. It was working perfectly 4-5 months ago. However, I'm getting an error now: **Unresolved attribute reference 'transform' for class 'object' ** Does anyone have ...

Jeff

11

asked Nov 4, 2022 at 21:25

1 vote

0 answers

211 views

Multiclass classification using multiple columns as input

I am using linearsvc for prediction. I want to use two columns to predict the class of an item. I have written code by using only one column how to inlcude two columns for that. labels_T = df_T['...

FRECEENA FRANCIS

49

asked Oct 13, 2022 at 6:36

2 votes

0 answers

736 views

How can I use LGBM early stopping rounds parameters in a Scikit-Learn pipeline?

I've been trying this for a while, but still couldn't figure out the solution. I have a pipeline with a few steps and a LGBM classifier, which I want to use with early_stopping_round parameter. ...

dsbr__0

291

asked Oct 3, 2022 at 12:41

1 vote

0 answers

41 views

Confusion regarding countvectorizer

why do I have to apply Countvectorizer on a smaller sample and then make the data frame? why can't I apply a count vectorizer to a large sample and create a data frame out of it? here is my code := ...

MUNIM BIN MUQUITH

21

asked Aug 28, 2022 at 13:18

1 vote

0 answers

147 views

Column not iterable, PySpark

I'm trying to perform a count vectorization using this function I've created however, I keep having an error returned stating "column not iterable" which I cannot figure out why and how to ...

anon_e

21

asked Aug 4, 2022 at 12:05

1 vote

1 answer

238 views

How to create row wise CSV for vectorized dataframe?

What I am trying to do is basically pulling out keywords from a processed file of a log file and creating a vectorized dataframe of those keywords. But when I am writing that dataframe into CSV, words ...

Ujjawal Pandey

37

asked Jul 10, 2022 at 12:09

2 votes

1 answer

1k views

How to group-by and get most frequent ngram?

My dataframe looks like this: ID topics text 1 1 twitter is my favorite social media 2 1 favorite social media 3 2 rt twitter tomorrow 4 3 rt facebook ...

CPDatascience

103

asked Jun 30, 2022 at 4:10

0 votes

1 answer

312 views

How can I vectorize a list of words?

I am working on SMS data where I have a list of words in my one column of dataframe I want to train a classifier to predict it's type and subtype. How would I convert the words into numerical format ...

aashish

11

asked Jun 11, 2022 at 6:06

1 vote

1 answer

467 views

Get topN keywords with PySpark CountVectorizer

I want to extract keywords using pyspark.ml.feature.CountVectorizer. My input Spark dataframe looks as following: id text 1 sun, mars, solar system, solar system, mars, solar system, venus, solar ...

red_quark

1,001

asked Jun 6, 2022 at 20:58

0 votes

1 answer

56 views

sklearn countvectroizer : results are shuffled

here i am using countvectorizer on some text. the result is one where counts don't match with words, for example in index 0, "rock" should have a count of 3 instead it shows 2 and "here&...

Omar Naguib

1

asked May 31, 2022 at 0:54

2 votes

0 answers

376 views

tfidfVectorizer on only one column in training set

I have a problem concerning the tfidfVectorizer. My problem is that I have 3 columns, one is the text that needs to be vectorized and the two others are already numbers, so I only need to vectorize ...

Christian Holm

21

asked May 17, 2022 at 12:59

3 votes

1 answer

541 views

Neither stemmer nor lemmatizer seem to work very well, what should I do?

I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', '...

Rebecca James

395

asked May 16, 2022 at 19:59

0 votes

2 answers

424 views

Numpy - array of arrays recognize as vector

I encounter a problem with numpy arrays. I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the ...

Alexandre Juan

3

asked May 16, 2022 at 9:31

0 votes

1 answer

794 views

Does CountVectorizer().fit_transform() preserve order of input?

I'm wondering if, when I use CountVectorizer().fit_transform(), the output preserves the order of the input. My input is a list of documents. I know that the output matches the input in terms of the ...

rookinn

3

asked May 3, 2022 at 11:50

1 vote

1 answer

102 views

Python CountVectorizer(): why do we have to assign CountVectorizer() to a variable in order for this to work?

I took this example from the SKLearn website. Here's the initial code: from sklearn.feature_extraction.text import CountVectorizer corpus = ['This is the first document.', 'This document is ...

Felipe Queiroz

123

asked Apr 27, 2022 at 14:42

0 votes

2 answers

975 views

Dataframe .join creates NaN valued column from actual values

What I want to do is create a bag of words for 11410 strings and then append at the end of the word columns the result that I have stored in another dataframe. I have a dataframe with the column '...

user1512676872

17

asked Apr 18, 2022 at 18:59

1 vote

2 answers

1k views

Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'

I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack ...

DJL

145

asked Apr 9, 2022 at 6:18

0 votes

1 answer

300 views

PySpark: Can't pickle CountVectorizerModel - TypeError: Cannot serialize socket object (but why is the socket library being used?)

I noticed that, unlike in Sci-kit learn, the PySpark implementation for CountVectorizer uses the socket library and so I'm unable to pickle it. Is there any way around this or another way to persist ...

pear

131

asked Mar 24, 2022 at 3:28

-1 votes

1 answer

3k views

ValueError: X has 5 features, but RandomForestClassifier is expecting 2607 features as input

This is how i am converting text to count vector. cv1 = CountVectorizer() x_traincv=cv1.fit_transform(x_train) a = x_traincv.toarray() a this the model using for predict. from sklearn.ensemble import ...

Pratik barahatte

1

asked Mar 7, 2022 at 18:42

0 votes

1 answer

78 views

From a bunch of n vectors, get all vectors which are mutually orthogonal

Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words) Approach that I ...

Abdur Rahman

1

asked Jan 27, 2022 at 3:39

1 vote

1 answer

1k views

How to get bag of words and term frequency in text format using Sklearn?

I would like to print out the list of words (i.e., bag of words) for each document in a coprus and their respective term frequency (in text format), using Sklearn's CountVectorizer. How could I ...

FAISAL BARGI

30

asked Jan 15, 2022 at 8:33

1 vote

2 answers

4k views

My Naive Bayes classifier works for my model but will not accept user input on my application

I am trying to deploy my machine learning Naive Bayes sentiment analysis model onto a web application. The idea is that the user should type some text, which the application performs sentiment ...

LilRi

69

asked Dec 23, 2021 at 11:46

0 votes

1 answer

401 views

CountVectorizer does not process my text data. It keep giving me AttributeError: 'list' object has no attribute 'lower'

I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following: 1. Convert text to lower case and remove all punctuation 2. Optionally apply ...

Yordan Иванов

13

asked Dec 14, 2021 at 13:26

1 vote

2 answers

119 views

Issue while inserting count vectorizer results to the dataframe

I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the ...

Satyam Anand

489

asked Dec 14, 2021 at 7:34

0 votes

1 answer

1k views

Transforming sentences to Numbers using SciKit-Learn’s CountVectorizer()

I am trying to convert a input sentence Review into a CountVectorizer. I am struggling to handle the sentences that are passed through. How do I deal with the sentences and add vectors to these? Any ...

N K

167

asked Dec 5, 2021 at 19:47

0 votes

1 answer

247 views

countvectorizer not able to detect , words

final_vocab = {'Amazon', 'Big Bazaar', 'Brand Factory', 'Central', 'Cleartrip', 'Dominos', 'Flipkart', 'IRCTC', 'Lenskart', 'Lifestyle', 'MAX', 'MMT', 'More', 'Myntra'} vect = CountVectorizer(...

qaiser

2,908

asked Nov 27, 2021 at 8:07

0 votes

1 answer

1k views

fit CountVectorizer on training and test data to not miss any words

I have a training date set for which I know the labels for the classification and a test data set where I do not havve the labels. Now, I want to fit the Vectorizer to the union of the training and ...

ghxk

33

asked Nov 14, 2021 at 18:09

0 votes

1 answer

536 views

CountVectorizer() in Python is returning all zeros when I pass a custom vocabulary list

from sklearn.feature_extraction.text import CountVectorizer foo = ["the Cat is :", "is smart now"] cv = CountVectorizer(vocabulary = foo) new_list =["the Cat is : the most&...

Raed

31

asked Nov 12, 2021 at 3:27

0 votes

0 answers

54 views

Encoding error with CSV file not allowing me to vectorize the data

I have a .csv file that has the format below: I am using pandas to read it and then encoding it using utf-8 but it looks like pandas isn't splitting the columns "Sentence" and "Label&...

need_help12

21

asked Nov 1, 2021 at 16:30

0 votes

0 answers

254 views

sci-kit learn CountVectorizer fit time taken or time estimate

I have around 100k texts in a numpy array. Is it possible to see the time remaining to complete the fit or at at least see the amount of fit that has happened (like tqdm for for-loops)? from sklearn....

Anirudh

25

asked Oct 10, 2021 at 7:46

2 votes

1 answer

1k views

'CountVectorizer' object has no attribute 'toarray'

I'm trying to vectorize some tweets so I can put it in a list and use it in a classficator.But it has a problem turn into DataFrame.

bo_

81

asked Sep 28, 2021 at 9:54

1 vote

2 answers

371 views

Python: ValueError on CountVectorizer. The truth value of a Series is ambiguous

I have this dataset and I'm trying to make Bag of Words out of it using sklearn CountVectorizer, but it throws me this error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool()...

Abbi KRK

53

asked Sep 14, 2021 at 6:55

2 votes

1 answer

402 views

How to prioritize certain features with max_features parameter in countvectorizer

I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to add a feature like &...

user16895885

21

asked Sep 13, 2021 at 3:39

Collectives™ on Stack Overflow