python - Extracting sequence and header from fasta file by known sequences -

June 15, 2015

i trying compare 2 files , extract sequences have subset of others. and, want extract identifiers too. however, can being able extract sequences including subsets. example files :

text.fa >header1 ettthaascisattvqeq*tlfrllp >header2 skspcsdsdy**aaa >header3 ssgavaaaptta

and,

textref.fa >textref.fa cisa aaap aatp

when run code, having output :

ettthaascisattvqeq*tlfrllp ssgavaaaptta

however, expected output headers:

>header1 ettthaascisattvqeq*tlfrllp >header3 ssgavaaaptta

my code in 2 parts, first create file these sequences, , try extract them headers original fasta file :

def get_nucl(filename):     open(filename,'r') fd:         nucl = []         line in fd:             if line[0]!='>':                 nucl.append(line.strip())         return nucl def finding(filename,reffile):         nucl = get_nucl(filename)         open(reffile,'r') reffile2:             line in reffile2:                 element in nucl:                     if line.strip() in element:                             yield(element)        open('sequencesmatched.txt','w') output:             results = finding('text.fa','textref.fa',)             res in results:                 print(res)                 output.write(res + '\n')

so, in sequencesmatched.txt, having sequences of text.fa have substrings of textref.fa. :

ettthaascisattvqeq*tlfrllp ssgavaaaptta

so in other part, retrieve respective headers , these sequences :

    def finding(filename,seqfile):         open(filename,'r') fastafile:                 open(seqfile,'r') sequf:                         alls=[]                         line in fastafile:                                 alls.append(line.strip())                         print(alls)                         sequfs = []                         line2 in sequf:                                 sequfs.append(line2.strip())                                 if str(line.strip()) == str(line2.strip()):                                         num = alls.index(line.strip())                                         print(alls[num-1] + line)   print(finding('text.fa','sequencesmatched.txt'))

however, output, can retrieve 1 sequence,which first match:

>header1 ettthaascisattvqeq*tlfrllp

maybe without second file, not make right loops sequences , respective headers. therefore, went long way..

i happy if can help!

you can easier if file same structure:

def get_nucl(filename):     open(filename, 'r') fd:         headers = {}         key = ''         line in fd.readlines():                 if '>' in line:                 key = line.strip()[1:] # remove '>'             else:                 headers[key] = line.strip()      return headers

here i'm assuming file begin ">headern" whatever, if not have add test. have dictionnary headers['header1'] = 'ettthaascisattvqeq*tlfrllp'.

so find matches use dict :

def finding(filename, reffile):     headers = get_nucl(filename)     open(reffile, 'r') f:         matches = {}         line in f.readlines():             key, value in headers.items():                 if line.stip() in value , key not in matches:                     matches[key] = value      return matches

so have dict headers matching values, can check in dict if have sub string, , have header value key.

just saw did print(finding(....) , function print, call it.

Search This Blog

Enable

python - Extracting sequence and header from fasta file by known sequences -

Comments

Post a Comment

Popular posts from this blog

Sort a complex associative array in PHP -

vb.net - How to ignore if a cell is empty nothing -

recursion - Can every recursive algorithm be improved with dynamic programming? -