My dataframe looks like this:
ID topics text
1 1 twitter is my favorite social media
2 1 favorite social media
3 2 rt twitter tomorrow
4 3 rt facebook today
5 3 rt twitter
6 4 vote for the best twitter
7 2 twitter tomorrow
8 4 best twitter
I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.
I want my output to look like this.
ID topics text biagram
1 1 twitter is my favorite social favorite social
2 1 favorite social media favorite social
3 2 rt twitter tomorrow twitter tomorrow
4 2 twitter tomorrow twitter tomorrow
5 3 rt twitter rt twitter
6 3 rt facebook today rt twitter
7 4 vote for the bes twitter best twitter
8 4 best twitter best twitter
Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.
This code will be run on 6M rows of data, so it needs to be fast.
What is the best way to do it using pandas? I apologize if it seems too complicated.