parallel processing - Is there a constant-time algorithm for generating a bandlimited sawtooth? -


i'm looking feasibility of gpu synthesized audio, each thread renders sample. puts interesting restrictions on algorithms can used - algorithm refers previous set of samples cannot implemented in fashion.

filtering 1 of algorithms. bandpass, lowpass, or highpass - of them require looking last few samples generated in order compute result. can't done because samples haven't been generated yet.

this makes synthesizing bandlimited waveforms difficult. 1 approach additive synthesis of partials using fourier series. however, runs @ o(n) time, , slow on gpu point gain of parallelism lost. if there algorithm ran @ o(1) time, eliminate branching , 1000x faster when dealing audible range.

i'm looking dsf sawtooth. i've been trying work out simplification of fourier series hand, that's really, hard. because involves harmonic numbers, aka singularity of riemann-zeta function.

is constant-time algorithm achievable? if not, can proven isn't?

filtering 1 of algorithms. bandpass, lowpass, or highpass - of them require looking last few samples generated in order compute result. can't done because samples haven't been generated yet.

that's not right. iir filters need previous results, fir filters need previous input; pretty typical things gpus designed do, it's not problem let every processing core access let's 64 input samples produce 1 output sample -- in fact, cache architectures nvidia , amd use lend that.

is constant-time algorithm achievable? if not, can proven isn't?

it is! in 2 aspects:

  1. as mentioned above, fir filters need multiple samples of immutable input, can parallelized heavily without problems, ,
  2. even if need calculate input first, , parallelize (i don't see reason -- generating sawtooth not cpu-limited, memory bandwidth limited), every core calculate last n samples -- sure, there's n-1 redundant operations, long number of cores bigger n, still faster, , every core have constant run time.

comments on approach:

i'm looking feasibility of gpu synthesized audio, each thread renders sample.

from higher-up perspective, sounds fine-granular. mean, let's have 3000 stream processors (high-end consumer gpu). assuming have sampling rate of 44.1khz, , assuming each of these processors 1 sample, letting them run once gives 1/14.7 of second of audio (mono). you'd have move on next part of audio.

in other words: there's bound much more samples processors. in these situations, it's typically way more efficient let 1 processor handle sequence of samples; example, if want generate 30s of audio, that'd 1.323ms (amples). splitting problem 3000 chunks, 1 each processor, , giving each of them 44100*30/3000=441 samples should process plus 64 samples of "history" before first of "own" samples still fit local memory.

yet thought:

i'm coming software defined radio background, there's millions of samples per second, rather few khz of sampling rate, in real time (i.e. processing speed > sampling rate). still, doing computation on gpu pays more cpu-intense tasks, because there's significant overhead in exchanging data gpu, , cpus nowadays blazingly fast. so, relatively simple problem, might never work faster things on gpu compared optimizing them on cpu; things of course different if you've got process lots of samples, or lot of streams, @ once. finer-granular tasks, problem of filling buffer, moving gpu, , getting result buffer software kills advantage.

hence, i'd challenge you: download gnu radio live dvd, burn dvd or write usb stick (you might run in vm, of course reduces performance if don't know how optimize virtualizer; - try live medium), run

volk_profile 

to let volk library test algorithms work best on specific machine, , launch

gnuradio-companion 

and then, run open following 2 signal processing flow graphs:

  1. "classical fir": classical fir
    single-threaded implementation of fir filter yields 50msamples/s on cpu.
  2. fir filter implemented fft, running on 4 threads: enter image description here
    implementation reaches 160msamples/s (!!) on cpu alone.

sure, of ffts on gpu, could faster, thing here is: "simple" fir filter, can, single cpu core, 50 megasamples out of machine -- meaning that, 44.1khz audio sampling rate, per single second can process 19 minutes of audio. no copying in , out of host ram. no gpu cooler spinning up. might not worth optimizing further. , if optimize , take fft-filter approach: 160ms/s means one hour of audio per processing second, including sawtooth generation.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -