0

I am trying to lemmatize words in a particular column ('body') using pandas.

I have tried the following code, that I found here

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer 
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()

df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in 
df['body'].head()

When I attempt to run the code, I get an error message that simply says

File "<ipython-input-41-c002479904b0>", line 33
  df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x)
   ^
SyntaxError: invalid syntax

I have also tried the solution presented in this post but didn't have any luck.

UPDATE: this is the full code so far

import pandas as pd
import re
import string


df1 = pd.read_csv('RP_text_posts.csv')
df2 = pd.read_csv('RP_text_comments.csv')
# Renaming columns so the post part - currently 'selftext' matches the post variable in the comments - 'body'
df1.columns = ['author','subreddit','score','num_comments','retrieved_on','id','created_utc','body']
# Dropping columns that aren't subreddit or the post content
df1 = df1.drop(columns=['author','score','num_comments','retrieved_on','id','created_utc'])
df2 = df2.drop(labels=None, columns=['author', 'score', 'created_utc'])
# Combining data
df = pd.concat([df1, df2])

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')

# Lemmatizing
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x) 
df['body'].head()`
4
  • Always share the entire error message, Commented Jan 26, 2020 at 20:05
  • Sorry, full error message is File "<ipython-input-41-c002479904b0>", line 33 df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x) ^ SyntaxError: invalid syntax Commented Jan 26, 2020 at 20:58
  • What code comes before that, is what you shared here everything there is? That doesn't look like it should throw a syntax error to me. Commented Jan 26, 2020 at 21:01
  • I added the full code so far, and corrected the column names. I think there's an option I might need to change after the lamda x: but I'm not sure, and didn't have any luck when I tested that by making my column header match the one specified in the example I was using that had it labeled 'words' Commented Jan 26, 2020 at 22:04

1 Answer 1

1

It miss the end of the lambda function:

df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x])) 

Update The line should be more like that but you can only lemmatize by one pos(adjective, or verb, or ...):

df['words'] = df['body'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(word) for word in word_tokenize(x)]))
print(df.head()))

If you want more, you can try the following code:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')


def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)



# Lemmatizing
df['words'] = df['body'].apply(lambda x: lemmatize_sentence(x))
print(df.head())

df result:

            body                    |        words

0  Best scores, good cats, it rocks | Best score , good cat , it rock

1          You received best scores |          You receive best score

2                         Good news |                       Good news

3                          Bad news |                        Bad news

4                    I am loving it |                    I be love it

5                    it rocks a lot |                   it rock a lot

6     it is still good to do better |     it be still good to do good
Sign up to request clarification or add additional context in comments.

1 Comment

Ah sorry, that was my mistake when I copied the code. It doesn't work even with that correction.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.