1

What I am trying to do is basically pulling out keywords from a processed file of a log file and creating a vectorized dataframe of those keywords. But when I am writing that dataframe into CSV, words are in the columns and their respective value in the second row. While I want the words to be in rows and their value in second column.

trial.py :

import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def removeNumbers(list):
   #doing something

def processFiles(filename):
   #doing something

def readFile(fileName):
   #doing something

# Build our text
processFiles("log.txt")
text = readFile("processedFile.txt")


vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])

counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names_out())



counts.to_csv("keywords_count.csv")

keywords_count.csv looks like this :

,accept,accepted,action,add,address,agent,allocated,api,api_action_sender,api_reader,apihandle,apiinitialize,apiterminate,appl,associate,attempt,available,bd,bdfb,broken,ceased,check_signals,chose,cksm,cl,clcat,client,close,code,complete,conf,configuration,connection,connfd,constructing,control,creating,ctcd,delresp,dereg,deregistering,does,dreg_process,dst,dump,edci,engine,entering,entity,entity_initialize,entries,entry,event,event_establishsessionsend,event_timert_expire,exist,exists,exit,exiting,expect,expired,failed,fc,file,filter,flg,flow,flow_timer_start,flow_timer_stop,forward,gateway,handle,home,hop,if,ifaeddrg_byaddr,ifidx,image,images,index,inf,info,informational,init_policyapi,initialization,initialized,install,interface,ioctl,ip,len,level,lih,link,list,local,locate_configfile,log,loopback,mailbox,mailbox_register,mailslot,mailslot_create,mailslot_send,mailslot_sitter,main,mcast_add,module,msg,necessary,new,node,obj,old,open_socket,operation,os,outgoing,papi_debug,papilogfunc,papiuservalue,path,pathdelta,pathed,pathtear,pipe,policy,process,proterr,proto,qoshandle,qoshd,qosmgr,qosmgr_request,qosmgr_response,query,querying,rapi,raw,rc,read_physical_netif,readbuffer,ready,reason,received,reentering,reg_process,registered,registering,registerwithpolicyapi,registration,remove,req,request,reservation,response,result,resv,resvdelta,resved,resvresp,return,returned,route,router_forward_getoi,rpapi_getpolicydata,rpapi_getspecdata,rpapi_reg_unregflow,rsv,rsvp,rsvp_action_nhop,rsvp_api_open,rsvp_event,rsvp_event_establishsession,rsvp_event_mapsession,rsvp_event_propagate,rsvp_explode_packet,rsvp_flow_statemachine,rsvp_hop,rsvp_parse_objects,rsvpd,rsvpfindactionname,rsvpfindservicedetailsonactname,rsvpgettspec,rsvpputactionname,rsvpremactionname,rthdl,send,sender,sender_withdraw,sending,service,sess,session,sessioned,setsockopt,settcpimage,sigalrm,signal,sigterm,socket,source,specified,src,start,started,state,status,stop,stopped,style,successful,supported,tc,tcp,tcpcs,term,term_policyapi,terminate,terminated,terminator,timer,tout,tr,trace,traffic,traffic_action_oif,traffic_reader,ttl,type,udp,unregistered,unregisterfrompolicyapi,user,using,vlink,warning,wf,writing
0,1,1,1,1,18,1,28,8,1,6,1,3,2,1,1,2,4,2,1,1,1,1,1,4,1,3,1,1,1,1,1,1,2,1,9,2,22,2,1,1,1,2,3,3,2,5,2,20,7,7,1,7,31,1,6,1,6,1,17,1,6,4,8,1,2,4,4,12,7,2,7,7,1,4,1,2,7,1,1,7,7,147,2,14,1,8,1,18,9,5,4,1,4,2,1,1,1,1,1,24,23,20,27,9,7,3,4,1,2,2,2,1,4,1,2,1,1,1,3,1,1,7,1,2,4,2,2,10,1,3,2,1,2,4,4,6,1,1,4,4,8,12,1,2,12,9,3,1,1,3,2,2,1,4,3,2,6,4,1,20,1,1,1,17,35,11,3,12,4,38,8,1,4,1,7,1,4,26,4,8,2,3,3,3,3,3,1,1,1,1,9,3,3,10,4,4,2,6,8,1,6,12,1,3,4,9,26,2,5,2,4,10,1,2,2,1,1,8,2,2,1,2,6,1,119,2,2,3,4,5,14,1,3,1,1,1,4,4,1

1 Answer 1

1

Transpose your dataframe:

counts.T.to_csv("keywords_count.csv")
Sign up to request clarification or add additional context in comments.

5 Comments

After transposing, it does solve the problem. But still the CSV has single column with words and their count just next to them. How to split the words and count into two different columns?
What do you mean? There are 2 columns in your csv file like "word,count" (comma separated). Do you want to change the separator between word and count columns? Try counts.T.to_csv("keywords_count.csv", sep='\t')
yeah, there are 2 columns but one is empty and both word and it's count is in second column seperated by \t. See Image I want word in one column and count in second.
Try counts = pd.Series(matrix.toarray()[0], index=vectorizer.get_feature_names_out(), name='count').rename_axis('word').reset_index(). Then counts.to_csv('keywords_count.csv')
Yeah now the CSV is in right format as I wanted. Thanks man!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.