linux - Faster way to find large files with Python? -


i trying use python find faster way sift through large directory(approx 1.1tb) containing around 9 other directories , finding files larger than, say, 200gb or on multiple linux servers, , has python.

i have tried many things calling du -h script du way slow go through directory large 1tb. i've tried find command find ./ +200g going take foreeeever.

i have tried os.walk() , doing .getsize() it's same problem- slow. of these methods take hours , hours , need finding solution if able me. because not have search large files on 1 server, have ssh through 300 servers , output giant list of files > 200gb, , 3 methods have tried not able done. appreciated, thank you!

that's not true cannot better os.walk()

scandir said 2 20 times faster.

from https://pypi.python.org/pypi/scandir

python’s built-in os.walk() slower needs be, because – in addition calling listdir() on each directory – calls stat() on each file determine whether filename directory or not. both findfirstfile / findnextfile on windows , readdir on linux/os x tell whether files returned directories or not, no further stat system calls needed. in short, can reduce number of system calls 2n n, n total number of files , directories in tree.

in practice, removing system calls makes os.walk() 7-50 times fast on windows, , 3-10 times fast on linux , mac os x. we’re not talking micro-optimizations.

from python 3.5, pep 471, scandir built-in, provided in os package. small (untested) example:

for dentry in os.scandir("/path/to/dir"):     if dentry.stat().st_size > max_value:        print("{} biiiig".format(dentry.name)) 

(of course need stat @ point, os.walk called stat implicitly when using function. also if files have specific extensions, perform stat when extension matches, saving more)

and there's more it:

so, providing scandir() iterator function calling directly, python's existing os.walk() function can sped huge amount.

so migrating python 3.5+ magically speeds os.walk without having rewrite code.

from experience, multiplying stat calls on networked drive catastrophic performance-wise, if target network drive, you'll benefit enhancement more local disk users.

the best way performance on networked drives, though, run scan tool on machine on drive locally mounted (using ssh instance). it's less convenient, it's worth it.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -