2

My dataframe looks like this:

ID topics   text
1     1        twitter is my favorite social media
2     1        favorite social media
3     2        rt twitter tomorrow
4     3        rt facebook today
5     3        rt twitter
6     4        vote for the best twitter
7     2        twitter tomorrow
8     4        best twitter

I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.

I want my output to look like this.

ID topics      text                                 biagram
1     1        twitter is my favorite social       favorite social
2     1        favorite social media               favorite  social
3     2        rt twitter tomorrow                 twitter tomorrow
4     2        twitter tomorrow                    twitter tomorrow
5     3        rt twitter                          rt twitter
6     3        rt facebook today           rt twitter 
7     4        vote for the bes twitter               best twitter
8     4        best twitter                        best twitter

Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.

This code will be run on 6M rows of data, so it needs to be fast.

What is the best way to do it using pandas? I apologize if it seems too complicated.

1
  • Check your input and output please? Where is "rt facebook today"? "vote for the best twitter" or "vote for the best rt twitter"? Commented Jun 30, 2022 at 4:36

1 Answer 1

3

Update

You can use sklearn:

trom sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
data = vect.fit_transform(df['text'])
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').sum().idxmax(axis=1))
df['bigram'] = df['topics'].map(bigram)
print(df)

# Output
   ID  topics                                 text            bigram
0   1       1  twitter is my favorite social media   favorite social
1   2       1                favorite social media   favorite social
2   3       2                  rt twitter tomorrow  twitter tomorrow
3   4       3                    rt facebook today    facebook today
4   5       3                           rt twitter    facebook today
5   6       4            vote for the best twitter      best twitter
6   7       2                     twitter tomorrow  twitter tomorrow
7   8       4                         best twitter      best twitter

Update 2

how about if I want the 3 most frequent ngrams. What can I use instead of idxmax()?

most_common3 = lambda x: x.sum().nlargest(3).index.to_frame(index=False).squeeze()
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').apply(most_common3)
            .rename(columns=lambda x: f"bigram{x+1}").reset_index())
df = df.merge(bigram, on='topics')
print(df)

# Output
   topics                                 text           bigram1       bigram2           bigram3
0       1  twitter is my favorite social media   favorite social  social media  twitter favorite
1       1                favorite social media   favorite social  social media  twitter favorite
2       2                  rt twitter tomorrow  twitter tomorrow    rt twitter      best twitter
3       2                     twitter tomorrow  twitter tomorrow    rt twitter      best twitter
4       3                    rt facebook today    facebook today   rt facebook        rt twitter
5       3                           rt twitter    facebook today   rt facebook        rt twitter
6       4            vote for the best twitter      best twitter     vote best    facebook today
7       4                         best twitter      best twitter     vote best    facebook today

Old answer

You can use nltk:

import nltk

to_bigram = lambda x: list(nltk.bigrams(x.split()))
most_common = (df.set_index('topics')['text'].map(to_bigram)
                 .groupby(level=0).apply(lambda x: x.mode()[0][0]))

df['bigram'] = df['topics'].map(most_common)
print(df)

# Output
   ID  topics                                 text              bigram
0   1       1  twitter is my favorite social media  (favorite, social)
1   2       1                favorite social media  (favorite, social)
2   3       2                  rt twitter tomorrow       (rt, twitter)
3   4       3                    rt facebook today      (rt, facebook)
4   5       3                           rt twitter      (rt, facebook)
5   6       4            vote for the best twitter     (best, twitter)
6   7       2                     twitter tomorrow       (rt, twitter)
7   8       4                         best twitter     (best, twitter)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you very much for replying. I really prefer to use countvectorizer, because It allows me to remove stop.words and set the range of ngrams from 3 to 4. I really appreciate your quick response.
Yes, thank you very much again! I have another question if is not too much to ask... how about if I want the 3 most frequent ngrams. What can I use instead of idxmax()?
@CPDatascience. I update my answer according your comment. Can you check it please?
Glad to help you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.