java - How do I effectively process a largs gzipped file in dataflow? -

January 15, 2013

we have som batch jobs process gzipped files ~10gb zipped , ~30gb unzipped.

trying process this, in java, takes unreasonable amount of time , looking how more effective. if use textio or native java sdk gcs download file takes more 8 hours process, , reason ut can scale out reason. won't split file since gzipped.

if unzipped file , process unzipped file job take 10 minute, in order of 100 times fast.

i can totally understand might take time process gzipped file, 100 times long time much.

you're correct gzipped files not splittable, dataflow has no way parallelize reading each gzipped input file. storing uncompressed in gcs best route if it's possible you.

regarding 100x performance difference: how many worker vms did pipeline scale in uncompressed vs compressed versions of pipeline? if have job id can internally investigate further.

Search This Blog

Enable

java - How do I effectively process a largs gzipped file in dataflow? -

Comments

Post a Comment

Popular posts from this blog

resizing Telegram inline keyboard -

javascript - How to bind ViewModel Store to View? -

javascript - Solution fails to pass one test with large inputs? -