How to create row wise CSV for vectorized dataframe?

Question

What I am trying to do is basically pulling out keywords from a processed file of a log file and creating a vectorized dataframe of those keywords. But when I am writing that dataframe into CSV, words are in the columns and their respective value in the second row. While I want the words to be in rows and their value in second column.

trial.py :

import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def removeNumbers(list):
   #doing something

def processFiles(filename):
   #doing something

def readFile(fileName):
   #doing something

# Build our text
processFiles("log.txt")
text = readFile("processedFile.txt")


vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])

counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names_out())



counts.to_csv("keywords_count.csv")

keywords_count.csv looks like this :

,accept,accepted,action,add,address,agent,allocated,api,api_action_sender,api_reader,apihandle,apiinitialize,apiterminate,appl,associate,attempt,available,bd,bdfb,broken,ceased,check_signals,chose,cksm,cl,clcat,client,close,code,complete,conf,configuration,connection,connfd,constructing,control,creating,ctcd,delresp,dereg,deregistering,does,dreg_process,dst,dump,edci,engine,entering,entity,entity_initialize,entries,entry,event,event_establishsessionsend,event_timert_expire,exist,exists,exit,exiting,expect,expired,failed,fc,file,filter,flg,flow,flow_timer_start,flow_timer_stop,forward,gateway,handle,home,hop,if,ifaeddrg_byaddr,ifidx,image,images,index,inf,info,informational,init_policyapi,initialization,initialized,install,interface,ioctl,ip,len,level,lih,link,list,local,locate_configfile,log,loopback,mailbox,mailbox_register,mailslot,mailslot_create,mailslot_send,mailslot_sitter,main,mcast_add,module,msg,necessary,new,node,obj,old,open_socket,operation,os,outgoing,papi_debug,papilogfunc,papiuservalue,path,pathdelta,pathed,pathtear,pipe,policy,process,proterr,proto,qoshandle,qoshd,qosmgr,qosmgr_request,qosmgr_response,query,querying,rapi,raw,rc,read_physical_netif,readbuffer,ready,reason,received,reentering,reg_process,registered,registering,registerwithpolicyapi,registration,remove,req,request,reservation,response,result,resv,resvdelta,resved,resvresp,return,returned,route,router_forward_getoi,rpapi_getpolicydata,rpapi_getspecdata,rpapi_reg_unregflow,rsv,rsvp,rsvp_action_nhop,rsvp_api_open,rsvp_event,rsvp_event_establishsession,rsvp_event_mapsession,rsvp_event_propagate,rsvp_explode_packet,rsvp_flow_statemachine,rsvp_hop,rsvp_parse_objects,rsvpd,rsvpfindactionname,rsvpfindservicedetailsonactname,rsvpgettspec,rsvpputactionname,rsvpremactionname,rthdl,send,sender,sender_withdraw,sending,service,sess,session,sessioned,setsockopt,settcpimage,sigalrm,signal,sigterm,socket,source,specified,src,start,started,state,status,stop,stopped,style,successful,supported,tc,tcp,tcpcs,term,term_policyapi,terminate,terminated,terminator,timer,tout,tr,trace,traffic,traffic_action_oif,traffic_reader,ttl,type,udp,unregistered,unregisterfrompolicyapi,user,using,vlink,warning,wf,writing
0,1,1,1,1,18,1,28,8,1,6,1,3,2,1,1,2,4,2,1,1,1,1,1,4,1,3,1,1,1,1,1,1,2,1,9,2,22,2,1,1,1,2,3,3,2,5,2,20,7,7,1,7,31,1,6,1,6,1,17,1,6,4,8,1,2,4,4,12,7,2,7,7,1,4,1,2,7,1,1,7,7,147,2,14,1,8,1,18,9,5,4,1,4,2,1,1,1,1,1,24,23,20,27,9,7,3,4,1,2,2,2,1,4,1,2,1,1,1,3,1,1,7,1,2,4,2,2,10,1,3,2,1,2,4,4,6,1,1,4,4,8,12,1,2,12,9,3,1,1,3,2,2,1,4,3,2,6,4,1,20,1,1,1,17,35,11,3,12,4,38,8,1,4,1,7,1,4,26,4,8,2,3,3,3,3,3,1,1,1,1,9,3,3,10,4,4,2,6,8,1,6,12,1,3,4,9,26,2,5,2,4,10,1,2,2,1,1,8,2,2,1,2,6,1,119,2,2,3,4,5,14,1,3,1,1,1,4,4,1

Corralien · Accepted Answer · 2022-07-10 12:18:18Z

1

Transpose your dataframe:

counts.T.to_csv("keywords_count.csv")

answered Jul 10, 2022 at 12:18

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ujjawal Pandey Over a year ago

After transposing, it does solve the problem. But still the CSV has single column with words and their count just next to them. How to split the words and count into two different columns?

Corralien Over a year ago

What do you mean? There are 2 columns in your csv file like "word,count" (comma separated). Do you want to change the separator between word and count columns? Try counts.T.to_csv("keywords_count.csv", sep='\t')

Ujjawal Pandey Over a year ago

yeah, there are 2 columns but one is empty and both word and it's count is in second column seperated by \t. See Image I want word in one column and count in second.

Corralien Over a year ago

Try

counts = pd.Series(matrix.toarray()[0], index=vectorizer.get_feature_names_out(), name='count').rename_axis('word').reset_index()

. Then counts.to_csv('keywords_count.csv')

Ujjawal Pandey Over a year ago

Yeah now the CSV is in right format as I wanted. Thanks man!

Collectives™ on Stack Overflow

How to create row wise CSV for vectorized dataframe?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related