statistics - Pruning a Tab Delimited File with Python -


so have bunch of tab delimited data files like:

subject phase   condition   trial   trial type  target loc  targetid    distid  digit1  digit2  accuracy-t  rt-p    rt-t 2   1   9   1   cong    bottom  s h f t   s h f t   7   2   1   742.69104   681.4379692 2   1   9   2   cong    top p s t e   p s t e   2   3   1   699.4130611 454.8609257 2   1   9   3   incong  top s u g r   y o u t h   6   5   1   979.2759418 31.06093407 2   1   9   4   incong  top c h e e k   g r o n   4   8   1   1025.339842 31.55088425 2   1   9   5   incong  bottom  s t l k   l e v e   7   9   1   555.9248924 479.6338081 2   1   9   6   incong  top b r n   f e l d   4   5   2   976.7041206 31.50486946 2   1   9   7   incong  bottom  c r o w n   p l t e   5   7   1   0   32.24992752 2   1   9   8   cong    top s t n d   s t n d   7   6   1   1092.888117 31.59618378 2   1   9   9   cong    bottom  r o u t e   r o u t e   4   8   1   883.2840919 31.32796288 2   1   9   10  cong    top f l o t   f l o t   5   6   1   768.682003 

what want strip file lines value '2' or '3' under 'accuracy-t' heading (sorry they're mis-alligned- it's 10th value).

so basic idea python script iterates function on multiple files (seen here 'studyfile') , spit out new tab delimited text file items removed (seen here 'goodstudyfile'). came this:

groupvar=['1','2'] subjectvar=['1','2'] condvar=['1','2','3','4','5','6','7','8','9','10','11','12']  group in groupvar:     subject in subjectvar:          condition in condvar:             studyfile_name = '*/pruning/study 126/group_'+str(group)+'_subject_'+str(subject)+'_condition_'+str(condition)+'_phase_1.txt'             studyfile = open(studyfile_name,'r')              goodstudyfile_name = '*/pruning/study 126/phase 1 no errors/group_'+str(group)+'_subject_'+str(subject)+'_condition_'+str(condition)+'_phase_1_fixed.txt'             goodstudyfile = open(goodstudyfile_name,'w')              study_lines = studyfile.readlines()              studyfile.close()              first_block = study_lines[4].split('\t')[1].strip()              nr_errors_removed = 0             r_errors_removed = 0             spoils_removed = 0             low_cutoff_spoils = 0             study_line in study_lines:                 if len(study_line.split('\t')) > 2:                     if study_line.split('\t')[10] == '2':                         if study_line.split('\t')[4] == 'incong':                             study_lines.remove(study_line)                             nr_errors_removed+=1                         elif study_line.split('\t')[4] == 'cong':                             study_lines.remove(study_line)                             r_errors_removed+=1                                                        elif study_line.split('\t')[10] == '3':                             study_lines.remove(study_line)                             spoils_removed+=1                     else:                         study_line in study_lines[1:]:                                                     if int(float(study_line.split('\t')[12][:8])) < 100.00:                                 study_lines.remove(study_line)                                 low_cutoff_spoils+=1             print 'group:' + str(group) + ' subject:' + str(subject) + ' condition:' + str(condition)             print 'nr errors:'+ str(nr_errors_removed)             print 'r errors:'+ str(r_errors_removed)             print 'spoils:'+ str(spoils_removed)             print 'low cutoff spoils:'+ str(low_cutoff_spoils)             goodstudyfile.write('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(nr_errors_removed, 'nr errors removed', r_errors_removed, 'r errors removed',spoils_removed, 'spoils removed',low_cutoff_spoils, 'low cutoff spoils'))             goodstudyfile.write('{}\n'.format(first_block))             line in study_lines:                 goodstudyfile.write(line)             goodstudyfile.close() 

so iterates fine across of files (48 files based on possible permutations of group, subject, , condvar combinations), reason regularly misses lines should deleted. in supposedly 'fixed' files, i'll still have bunch of lines ought have been removed.

nothing seems fix or change outcome- missed lines consistent (ie. miss line 7 of group2_subject1_condition_6 despite line 7 being tagged '2'). tell me i'm going wrong?

and here's example of 1 of lines it's missing:

subject phase   condition   trial   trial type  target loc  targetid    distid  digit1  digit2  accuracy-t  rt-p    rt-t     1   1   6   25  incong  top v l u e   g u d e   9   7   2   304.780960083   866.713047028 

which should have been pruned python script since has value of '2' under accuracy-t


Comments