java - How do I effectively process a largs gzipped file in dataflow? -


we have som batch jobs process gzipped files ~10gb zipped , ~30gb unzipped.

trying process this, in java, takes unreasonable amount of time , looking how more effective. if use textio or native java sdk gcs download file takes more 8 hours process, , reason ut can scale out reason. won't split file since gzipped.

if unzipped file , process unzipped file job take 10 minute, in order of 100 times fast.

i can totally understand might take time process gzipped file, 100 times long time much.

you're correct gzipped files not splittable, dataflow has no way parallelize reading each gzipped input file. storing uncompressed in gcs best route if it's possible you.

regarding 100x performance difference: how many worker vms did pipeline scale in uncompressed vs compressed versions of pipeline? if have job id can internally investigate further.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -