r - Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring -
i particularly looking @ r, perl, , shell. other programming language fine too.
question
is there way visually or programmatically inspect , index matched string based on regex? intended referencing first regex , results inside of second regex, able modify part of matched string , write new rules particular part.
https://regex101.com visualize how string matches regular expression. far perfect , not efficient huge dataset.
problem
i have around 12000 matched strings (dna sequences) first regex, , want process these strings , based on strict rules find other strings in second file go 12000 matches based on strict rules.
simplified example
this first regex (a simplified, shorter version of original regex) runs through first text file.
[acgt]{1,12000}(aac)[ag]{2,5}[acgt]{2,5}(ctgtgta)
let's suppose finds following 3 sub-strings in large text file:
1. aaacccgtgtaataacagacgtactgtgta 2. tttttttgcgaccgagaaacggttctgtgta 3. taacaaggaccctgtgta now have second file includes large string. second file, interested in extracting sub-strings match new (second) regex dependent on first regex in few sections. therefore, second regex has take account substrings matched in first file , @ how have matched first regex!
allow me, sake of simplicity, index first regex better illustration in way:
first.regex.p1 = [acgt]{1,12000} first.regex.p2 = (aac) first.regex.p3 = [ag]{2,5} first.regex.p4 = [acgt]{2,5} first.regex.p5 = (ctgtgta) now second (new) regex search second text file , dependent on results of first regex (and how substrings returned first file have matched first regex) defined in following way:
second.regex = (ctaaa)[ac]{5,100}(tttggg){**rule1**} (ctt)[ag]{10,5000}{**rule2**}
in here rule1 , rule2 dependent on matches coming first regex on first file. hence;
rule1 = @ matched strings file1 , complement pattern of first.regex.p3 found in matched substring file1 (the complement should of course have same length) rule2 = @ matched strings file1 , complement pattern of first.regex.p4 found in matched substring file1 (the complement should of course have same length) you can see second regex has sections belong (i.e. independent of other file/regex), has sections dependent on results of first file , rules of first regex , how each sub-string in first file has matched first regex!
now again sake of simplicity, use third matched substring file1 (because shorter other two) show how possible match second file looks , how satisfies second regex:
this had our first regex run through first file:
3. taacaaggaccctgtgta so in match, see that:
t has matched first.regex.p1 aac has matched first.regex.p2 aagga has matched first.regex.p3 cc first.regex.p4 ctgtgta has matched first.regex.p5 now in our second regex second file see when looking substring matches second regex, dependent on results coming first file (which match first regex). particularly need @ matched substrings , complement parts matched first.regex.p3 , first.regex.p4 (rule1 , rule2 second.regex).
complement means: substituted t t -> g -> c c -> g so if have taaa, complement attt.
therefore, going example:
- taacaaggaccctgtgta
we need complement following satisfy requirements of second regex:
aagga has matched first.regex.p3 cc first.regex.p4 and complements are:
ttcct (based on rule1) gg (based on rule2) so example of substring matches second.regex this:
ctaaaacacctttgggttcctcttaaaaaaaaagggggagagagaagaaaaaaagagaggg
this 1 example! in case have 12000 matched substrings!! cannot figure out how approach problem. have tried writing pure regex have failed implement follows logic.. perhaps shouldn't using regex?
is possible entirely regex? or should @ approach? possible index regex , in second regex reference first regex , force regex consider matched substrings returned first regex?
this can done programmatically in perl, or other language.
since need input 2 different files, cannot in pure regex, regex cannot read files. cannot in 1 pattern, no regex engine remembers matched before on different input string. has done in program surrounding matches, should regex, that's regex meant for.
you can build second pattern step step. i've implemented more advanced version in perl can adapted suit other pattern combinations well, without changing actual code work.
instead of file 1, use data section. holds 3 example input strings. instead of file 2, use example output third input string.
the main idea behind split both patterns sub-patterns. first one, can use array of patterns. second one, create anonymous functions call match results first pattern construct second complete pattern. of them return fixed string, 2 take value arguments build complements.
use strict; use warnings; sub complement { $string = shift; $string =~ tr/atgc/tacg/; # transliteration, faster s/// return $string; } # first regex, split sub-patterns @first = ( qr([acgt]{1,12000}), qr(aac), qr([ag]{2,5}), qr([acgt]{2,5}), qr(ctgtgta), ); # second regex, split sub-patterns callbacks @second = ( sub { return qr(ctaaa) }, sub { return qr([ac]{5,100}) }, sub { return qr(tttggg) }, sub { (@matches) = @_; # complement pattern of first.regex.p3 return complement( $matches[3] ); }, sub { return qr(ctt) }, sub { return qr([ag]{10,5000}) }, sub { (@matches) = @_; # complement pattern of first.regex.p4 return complement( $matches[4] ); }, ); $file2 = "ctaaaacacctttgggttcctcttaaaaaaaaagggggagagagaagaaaaaaagagaggg"; while ( $file1 = <data> ) { # pattern match full thing in $1, , each sub-section in $2, $3, ... # @matches contain (full, $2, $3, $4, $5, $6) @matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g ); # iterate list of anonymous functions , call each of them, # passing in match results of first match $pattern2 = join q{}, map { '(' . $_->(@matches) . ')' } @second; @matches2 = ( $file2 =~ m/($pattern2)/ ); } __data__ aaacccgtgtaataacagacgtactgtgta tttttttgcgaccgagaaacggttctgtgta taacaaggaccctgtgta these generated second patterns 3 input substrings.
((?^:ctaaa))((?^:[ac]{5,100}))((?^:tttggg))(tct)((?^:ctt))((?^:[ag]{10,5000}))(gcat) ((?^:ctaaa))((?^:[ac]{5,100}))((?^:tttggg))(cc)((?^:ctt))((?^:[ag]{10,5000}))(aa) ((?^:ctaaa))((?^:[ac]{5,100}))((?^:tttggg))(ttcct)((?^:ctt))((?^:[ag]{10,5000}))(gg) if you're not familiar this, it's happens if print pattern constructed quoted regex operator qr//.
the pattern matches example output third case. resulting @matches2 looks when dumped out using data::printer.
[ [0] "ctaaaacacctttgggttcctcttaaaaaaaaagggggagagagaagaaaaaaagagaggg", [1] "ctaaa", [2] "acacc", [3] "tttggg", [4] "ttcct", [5] "ctt", [6] "aaaaaaaaagggggagagagaagaaaaaaagagag", [7] "gg" ] i cannot speed of implementation, believe reasonable fast.
if wanted find other combinations of patterns, had replace sub { ... } entries in 2 arrays. if there different number 5 of them first match, you'd construct pattern programmatically. i've not done above keep things simpler. here's like.
my @matches = ( $file1 =~ join q{}, map { "($_)" } @first); if want learn more kind of strategy, suggest read mark jason dominus' excellent higher order perl, available free pdf here.
Comments
Post a Comment