337 questions
0
votes
0
answers
26
views
Configuring 'char_wb' for CountVectorsFeaturizer in RASA
I have seen this configuration in default RASA chatbot, why do we need this setting for featurization, and in what specific languages? If I built a chatbot for non-English conversations, for example ...
1
vote
0
answers
50
views
How do I resolve vectorizer mismatch
I am using TfidfVectorizer for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity.
My Situation looks like:
firstly,
def clean_text(text):
return re....
0
votes
1
answer
185
views
How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?
I have the following data for training a model to detect whether a sentence is about:
a cat or dog
NOT about a cat or dog
I ran the following code to train a DecisionTreeClassifier() model then view ...
2
votes
1
answer
146
views
Memory Issue: Creating Bigrams and Trigrams with CountVectorizer
I am trying to create a document term matrix using CountVectorizer to extract bigrams and trigrams from a corpus.
from sklearn.feature_extraction.text import CountVectorizer
lemmatized = dat_clean['...
0
votes
1
answer
145
views
Where does the model pipeline get the data for the bag of words features from?
I am quite new to deep learning but I am working on this little binary text classification experiment. I want to investigate what impact the training data size has on the metrics of the model (does a ...
1
vote
1
answer
307
views
CountVectorizer for number
I have a list of numbers and I want to use CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
def x(n):
return str(n)
sentences = [5,10,15,10,5,10]
vectorizer = ...
0
votes
2
answers
1k
views
AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names' -- Topic Modeling -- Latent Dirichlet Allocation
I'm trying to follow the example from the link below.
https://medium.datadriveninvestor.com/trump-tweets-topic-modeling-using-latent-dirichlet-allocation-e4f93b90b6fe
All the code up to this point ...
0
votes
1
answer
178
views
half space (\u200c) don't support in CountVectorizer
in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word .
I will be grateful to guide me.
thank you.
i used "درختهای زیبا" in ...
2
votes
2
answers
161
views
why smote raise "Found input variables with inconsistent numbers of samples"?
I try to classify emotion from tweet with dataset of 4401 tweet, when i use smaller sample of data (around 15 tweet) everything just work fine, but when i use the full dataset it raise the error of
...
1
vote
0
answers
98
views
CountVectorizer not working in ColumnTransformer
Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import ...
1
vote
0
answers
39
views
Create a token pattern based on the concatenation of some given words [duplicate]
I have a list of keywords, for example:
keywords = ['airbnb.com', 'booking', 'deliveroo.uk - UK', ...]
My goal is to define the parameter token_pattern of CountVectorizer by concatenating all keywords....
0
votes
1
answer
290
views
Custom tokenizer not working in countvectorizer sklearn
I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.
temp_tok = [...
1
vote
0
answers
149
views
Combine numpy array with TfidfVectorizer as a joint feature matrix in SKLearn
I have a dataset input, which is a list of ~40000 letters (that are represented as strings).
With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1:
import numpy as np
...
0
votes
1
answer
244
views
Remove features with whitespace in sklearn Countvectorizer with char_wb
I am trying to build char level ngrams using sklearn's CountVectorizer.
When using analyzer='char_wb' the vocab has features with whitespaces around it. I want to exclude the features/words with ...
1
vote
1
answer
84
views
Retain original document element index of argument passed through sklearn's CountVectorizer() in order to access corresponding part of speech tag
I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my ...
0
votes
1
answer
664
views
how do you get the frequency of the terms generated by tfidf.get_feature_names_out()
After fitting with tfidf, I'm looking at the features that were generated:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document ...
0
votes
1
answer
127
views
Incompatible row dimensions when using passthrough in GridSearch over sklearn Pipeline with FeatureUnion
I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; ...
1
vote
0
answers
118
views
Unresolved attribute reference 'transform' for class 'object'
I'm having a problem with this CV code. It was working perfectly 4-5 months ago. However, I'm getting an error now: **Unresolved attribute reference 'transform' for class 'object'
**
Does anyone have ...
1
vote
0
answers
211
views
Multiclass classification using multiple columns as input
I am using linearsvc for prediction. I want to use two columns to predict the class of an item. I have written code by using only one column how to inlcude two columns for that.
labels_T = df_T['...
2
votes
0
answers
736
views
How can I use LGBM early stopping rounds parameters in a Scikit-Learn pipeline?
I've been trying this for a while, but still couldn't figure out the solution.
I have a pipeline with a few steps and a LGBM classifier, which I want to use with early_stopping_round parameter. ...
1
vote
0
answers
41
views
Confusion regarding countvectorizer
why do I have to apply Countvectorizer on a smaller sample and then make the data frame? why can't I apply a count vectorizer to a large sample and create a data frame out of it?
here is my code :=
...
1
vote
0
answers
147
views
Column not iterable, PySpark
I'm trying to perform a count vectorization using this function I've created however, I keep having an error returned stating "column not iterable" which I cannot figure out why and how to ...
1
vote
1
answer
238
views
How to create row wise CSV for vectorized dataframe?
What I am trying to do is basically pulling out keywords from a processed file of a log file and creating a vectorized dataframe of those keywords. But when I am writing that dataframe into CSV, words ...
2
votes
1
answer
1k
views
How to group-by and get most frequent ngram?
My dataframe looks like this:
ID topics text
1 1 twitter is my favorite social media
2 1 favorite social media
3 2 rt twitter tomorrow
4 3 rt facebook ...
0
votes
1
answer
312
views
How can I vectorize a list of words?
I am working on SMS data where I have a list of words in my one column of dataframe
I want to train a classifier to predict it's type and subtype.
How would I convert the words into numerical format ...
1
vote
1
answer
467
views
Get topN keywords with PySpark CountVectorizer
I want to extract keywords using pyspark.ml.feature.CountVectorizer.
My input Spark dataframe looks as following:
id
text
1
sun, mars, solar system, solar system, mars, solar system, venus, solar ...
0
votes
1
answer
56
views
sklearn countvectroizer : results are shuffled
here i am using countvectorizer on some text.
the result is one where counts don't match with words,
for example in index 0, "rock" should have a count of 3 instead it shows 2 and "here&...
2
votes
0
answers
376
views
tfidfVectorizer on only one column in training set
I have a problem concerning the tfidfVectorizer.
My problem is that I have 3 columns, one is the text that needs to be vectorized and the two others are already numbers, so I only need to vectorize ...
3
votes
1
answer
541
views
Neither stemmer nor lemmatizer seem to work very well, what should I do?
I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', '...
0
votes
2
answers
424
views
Numpy - array of arrays recognize as vector
I encounter a problem with numpy arrays.
I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the ...
0
votes
1
answer
794
views
Does CountVectorizer().fit_transform() preserve order of input?
I'm wondering if, when I use CountVectorizer().fit_transform(), the output preserves the order of the input.
My input is a list of documents. I know that the output matches the input in terms of the ...
1
vote
1
answer
102
views
Python CountVectorizer(): why do we have to assign CountVectorizer() to a variable in order for this to work?
I took this example from the SKLearn website. Here's the initial code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
'This document is ...
0
votes
2
answers
975
views
Dataframe .join creates NaN valued column from actual values
What I want to do is create a bag of words for 11410 strings and then append at the end of the word columns the result that I have stored in another dataframe. I have a dataframe with the column '...
1
vote
2
answers
1k
views
Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'
I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack ...
0
votes
1
answer
300
views
PySpark: Can't pickle CountVectorizerModel - TypeError: Cannot serialize socket object (but why is the socket library being used?)
I noticed that, unlike in Sci-kit learn, the PySpark implementation for CountVectorizer uses the socket library and so I'm unable to pickle it.
Is there any way around this or another way to persist ...
-1
votes
1
answer
3k
views
ValueError: X has 5 features, but RandomForestClassifier is expecting 2607 features as input
This is how i am converting text to count vector.
cv1 = CountVectorizer()
x_traincv=cv1.fit_transform(x_train)
a = x_traincv.toarray()
a
this the model using for predict.
from sklearn.ensemble import ...
0
votes
1
answer
78
views
From a bunch of n vectors, get all vectors which are mutually orthogonal
Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words)
Approach that I ...
1
vote
1
answer
1k
views
How to get bag of words and term frequency in text format using Sklearn?
I would like to print out the list of words (i.e., bag of words) for each document in a coprus and their respective term frequency (in text format), using Sklearn's CountVectorizer. How could I ...
1
vote
2
answers
4k
views
My Naive Bayes classifier works for my model but will not accept user input on my application
I am trying to deploy my machine learning Naive Bayes sentiment analysis model onto a web application. The idea is that the user should type some text, which the application performs sentiment ...
0
votes
1
answer
401
views
CountVectorizer does not process my text data. It keep giving me AttributeError: 'list' object has no attribute 'lower'
I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following:
1. Convert text to lower case and remove all punctuation
2. Optionally apply ...
1
vote
2
answers
119
views
Issue while inserting count vectorizer results to the dataframe
I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the ...
0
votes
1
answer
1k
views
Transforming sentences to Numbers using SciKit-Learn’s CountVectorizer()
I am trying to convert a input sentence Review into a CountVectorizer. I am struggling to handle the sentences that are passed through. How do I deal with the sentences and add vectors to these? Any ...
0
votes
1
answer
247
views
countvectorizer not able to detect , words
final_vocab = {'Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra'}
vect = CountVectorizer(...
0
votes
1
answer
1k
views
fit CountVectorizer on training and test data to not miss any words
I have a training date set for which I know the labels for the classification and a test data set where I do not havve the labels.
Now, I want to fit the Vectorizer to the union of the training and ...
0
votes
1
answer
536
views
CountVectorizer() in Python is returning all zeros when I pass a custom vocabulary list
from sklearn.feature_extraction.text import CountVectorizer
foo = ["the Cat is :", "is smart now"]
cv = CountVectorizer(vocabulary = foo)
new_list =["the Cat is : the most&...
0
votes
0
answers
54
views
Encoding error with CSV file not allowing me to vectorize the data
I have a .csv file that has the format below:
I am using pandas to read it and then encoding it using utf-8 but it looks like pandas isn't splitting the columns "Sentence" and "Label&...
0
votes
0
answers
254
views
sci-kit learn CountVectorizer fit time taken or time estimate
I have around 100k texts in a numpy array.
Is it possible to see the time remaining to complete the fit
or at at least see the amount of fit that has happened
(like tqdm for for-loops)?
from sklearn....
2
votes
1
answer
1k
views
'CountVectorizer' object has no attribute 'toarray'
I'm trying to vectorize some tweets so I can put it in a list and use it in a classficator.But it has a problem turn into DataFrame.
1
vote
2
answers
371
views
Python: ValueError on CountVectorizer. The truth value of a Series is ambiguous
I have this dataset and I'm trying to make Bag of Words out of it using sklearn CountVectorizer, but it throws me this error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool()...
2
votes
1
answer
402
views
How to prioritize certain features with max_features parameter in countvectorizer
I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to add a feature like &...