Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
26 views

I have seen this configuration in default RASA chatbot, why do we need this setting for featurization, and in what specific languages? If I built a chatbot for non-English conversations, for example ...
Minh Khuất Đức's user avatar
1 vote
0 answers
50 views

I am using TfidfVectorizer for text vectorizer but i am experiencing dimension mismatch when i try to obtain cosine_similarity. My Situation looks like: firstly, def clean_text(text): return re....
KIZ-MAN's user avatar
  • 33
0 votes
1 answer
185 views

I have the following data for training a model to detect whether a sentence is about: a cat or dog NOT about a cat or dog I ran the following code to train a DecisionTreeClassifier() model then view ...
code_to_joy's user avatar
2 votes
1 answer
146 views

I am trying to create a document term matrix using CountVectorizer to extract bigrams and trigrams from a corpus. from sklearn.feature_extraction.text import CountVectorizer lemmatized = dat_clean['...
Kaitlin's user avatar
  • 83
0 votes
1 answer
145 views

I am quite new to deep learning but I am working on this little binary text classification experiment. I want to investigate what impact the training data size has on the metrics of the model (does a ...
clowny's user avatar
  • 1
1 vote
1 answer
307 views

I have a list of numbers and I want to use CountVectorizer from sklearn.feature_extraction.text import CountVectorizer def x(n): return str(n) sentences = [5,10,15,10,5,10] vectorizer = ...
saraafr's user avatar
  • 143
0 votes
2 answers
1k views

I'm trying to follow the example from the link below. https://medium.datadriveninvestor.com/trump-tweets-topic-modeling-using-latent-dirichlet-allocation-e4f93b90b6fe All the code up to this point ...
ASH's user avatar
  • 20.5k
0 votes
1 answer
178 views

in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word . I will be grateful to guide me. thank you. i used "درخت‌های زیبا" in ...
Sedghian's user avatar
2 votes
2 answers
161 views

I try to classify emotion from tweet with dataset of 4401 tweet, when i use smaller sample of data (around 15 tweet) everything just work fine, but when i use the full dataset it raise the error of ...
life_student's user avatar
1 vote
0 answers
98 views

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case: from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.text import ...
Chukwudi's user avatar
  • 324
1 vote
0 answers
39 views

I have a list of keywords, for example: keywords = ['airbnb.com', 'booking', 'deliveroo.uk - UK', ...] My goal is to define the parameter token_pattern of CountVectorizer by concatenating all keywords....
LJG's user avatar
  • 787
0 votes
1 answer
290 views

I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later. temp_tok = [...
Anuj Chopra's user avatar
1 vote
0 answers
149 views

I have a dataset input, which is a list of ~40000 letters (that are represented as strings). With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1: import numpy as np ...
TiMauzi's user avatar
  • 236
0 votes
1 answer
244 views

I am trying to build char level ngrams using sklearn's CountVectorizer. When using analyzer='char_wb' the vocab has features with whitespaces around it. I want to exclude the features/words with ...
Ankit Bansal's user avatar
1 vote
1 answer
84 views

I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my ...
OLGJ's user avatar
  • 472
0 votes
1 answer
664 views

After fitting with tfidf, I'm looking at the features that were generated: from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This document ...
james pow's user avatar
  • 356
0 votes
1 answer
127 views

I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; ...
MichaelU's user avatar
  • 125
1 vote
0 answers
118 views

I'm having a problem with this CV code. It was working perfectly 4-5 months ago. However, I'm getting an error now: **Unresolved attribute reference 'transform' for class 'object' ** Does anyone have ...
Jeff's user avatar
  • 11
1 vote
0 answers
211 views

I am using linearsvc for prediction. I want to use two columns to predict the class of an item. I have written code by using only one column how to inlcude two columns for that. labels_T = df_T['...
FRECEENA FRANCIS's user avatar
2 votes
0 answers
736 views

I've been trying this for a while, but still couldn't figure out the solution. I have a pipeline with a few steps and a LGBM classifier, which I want to use with early_stopping_round parameter. ...
dsbr__0's user avatar
  • 291
1 vote
0 answers
41 views

why do I have to apply Countvectorizer on a smaller sample and then make the data frame? why can't I apply a count vectorizer to a large sample and create a data frame out of it? here is my code := ...
MUNIM BIN MUQUITH's user avatar
1 vote
0 answers
147 views

I'm trying to perform a count vectorization using this function I've created however, I keep having an error returned stating "column not iterable" which I cannot figure out why and how to ...
anon_e's user avatar
  • 21
1 vote
1 answer
238 views

What I am trying to do is basically pulling out keywords from a processed file of a log file and creating a vectorized dataframe of those keywords. But when I am writing that dataframe into CSV, words ...
Ujjawal Pandey's user avatar
2 votes
1 answer
1k views

My dataframe looks like this: ID topics text 1 1 twitter is my favorite social media 2 1 favorite social media 3 2 rt twitter tomorrow 4 3 rt facebook ...
CPDatascience's user avatar
0 votes
1 answer
312 views

I am working on SMS data where I have a list of words in my one column of dataframe I want to train a classifier to predict it's type and subtype. How would I convert the words into numerical format ...
aashish's user avatar
  • 11
1 vote
1 answer
467 views

I want to extract keywords using pyspark.ml.feature.CountVectorizer. My input Spark dataframe looks as following: id text 1 sun, mars, solar system, solar system, mars, solar system, venus, solar ...
red_quark's user avatar
  • 1,001
0 votes
1 answer
56 views

here i am using countvectorizer on some text. the result is one where counts don't match with words, for example in index 0, "rock" should have a count of 3 instead it shows 2 and "here&...
Omar Naguib's user avatar
2 votes
0 answers
376 views

I have a problem concerning the tfidfVectorizer. My problem is that I have 3 columns, one is the text that needs to be vectorized and the two others are already numbers, so I only need to vectorize ...
Christian Holm's user avatar
3 votes
1 answer
541 views

I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', '...
Rebecca James's user avatar
0 votes
2 answers
424 views

I encounter a problem with numpy arrays. I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the ...
Alexandre Juan's user avatar
0 votes
1 answer
794 views

I'm wondering if, when I use CountVectorizer().fit_transform(), the output preserves the order of the input. My input is a list of documents. I know that the output matches the input in terms of the ...
rookinn's user avatar
1 vote
1 answer
102 views

I took this example from the SKLearn website. Here's the initial code: from sklearn.feature_extraction.text import CountVectorizer corpus = ['This is the first document.', 'This document is ...
Felipe Queiroz's user avatar
0 votes
2 answers
975 views

What I want to do is create a bag of words for 11410 strings and then append at the end of the word columns the result that I have stored in another dataframe. I have a dataframe with the column '...
user1512676872's user avatar
1 vote
2 answers
1k views

I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack ...
DJL's user avatar
  • 145
0 votes
1 answer
300 views

I noticed that, unlike in Sci-kit learn, the PySpark implementation for CountVectorizer uses the socket library and so I'm unable to pickle it. Is there any way around this or another way to persist ...
pear's user avatar
  • 131
-1 votes
1 answer
3k views

This is how i am converting text to count vector. cv1 = CountVectorizer() x_traincv=cv1.fit_transform(x_train) a = x_traincv.toarray() a this the model using for predict. from sklearn.ensemble import ...
Pratik barahatte's user avatar
0 votes
1 answer
78 views

Original problem - context: NLP - from a list of n strings, choose all the strings which don't have common words (without considering the words in a pre-defined list of stop words) Approach that I ...
Abdur Rahman's user avatar
1 vote
1 answer
1k views

I would like to print out the list of words (i.e., bag of words) for each document in a coprus and their respective term frequency (in text format), using Sklearn's CountVectorizer. How could I ...
FAISAL BARGI's user avatar
1 vote
2 answers
4k views

I am trying to deploy my machine learning Naive Bayes sentiment analysis model onto a web application. The idea is that the user should type some text, which the application performs sentiment ...
LilRi's user avatar
  • 69
0 votes
1 answer
401 views

I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following: 1. Convert text to lower case and remove all punctuation 2. Optionally apply ...
Yordan Иванов's user avatar
1 vote
2 answers
119 views

I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the ...
Satyam Anand's user avatar
0 votes
1 answer
1k views

I am trying to convert a input sentence Review into a CountVectorizer. I am struggling to handle the sentences that are passed through. How do I deal with the sentences and add vectors to these? Any ...
N K's user avatar
  • 167
0 votes
1 answer
247 views

final_vocab = {'Amazon', 'Big Bazaar', 'Brand Factory', 'Central', 'Cleartrip', 'Dominos', 'Flipkart', 'IRCTC', 'Lenskart', 'Lifestyle', 'MAX', 'MMT', 'More', 'Myntra'} vect = CountVectorizer(...
qaiser's user avatar
  • 2,908
0 votes
1 answer
1k views

I have a training date set for which I know the labels for the classification and a test data set where I do not havve the labels. Now, I want to fit the Vectorizer to the union of the training and ...
ghxk's user avatar
  • 33
0 votes
1 answer
536 views

from sklearn.feature_extraction.text import CountVectorizer foo = ["the Cat is :", "is smart now"] cv = CountVectorizer(vocabulary = foo) new_list =["the Cat is : the most&...
Raed's user avatar
  • 31
0 votes
0 answers
54 views

I have a .csv file that has the format below: I am using pandas to read it and then encoding it using utf-8 but it looks like pandas isn't splitting the columns "Sentence" and "Label&...
need_help12's user avatar
0 votes
0 answers
254 views

I have around 100k texts in a numpy array. Is it possible to see the time remaining to complete the fit or at at least see the amount of fit that has happened (like tqdm for for-loops)? from sklearn....
Anirudh's user avatar
  • 25
2 votes
1 answer
1k views

I'm trying to vectorize some tweets so I can put it in a list and use it in a classficator.But it has a problem turn into DataFrame.
bo_'s user avatar
  • 81
1 vote
2 answers
371 views

I have this dataset and I'm trying to make Bag of Words out of it using sklearn CountVectorizer, but it throws me this error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool()...
Abbi KRK's user avatar
2 votes
1 answer
402 views

I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to add a feature like &...
user16895885's user avatar

1
2 3 4 5
7