linux - Faster way to find large files with Python? -
i trying use python find faster way sift through large directory(approx 1.1tb) containing around 9 other directories , finding files larger than, say, 200gb or on multiple linux servers, , has python.
i have tried many things calling du -h script du way slow go through directory large 1tb. i've tried find command find ./ +200g going take foreeeever.
i have tried os.walk() , doing .getsize() it's same problem- slow. of these methods take hours , hours , need finding solution if able me. because not have search large files on 1 server, have ssh through 300 servers , output giant list of files > 200gb, , 3 methods have tried not able done. appreciated, thank you!
that's not true cannot better os.walk()
scandir
said 2 20 times faster.
from https://pypi.python.org/pypi/scandir
python’s built-in os.walk() slower needs be, because – in addition calling listdir() on each directory – calls stat() on each file determine whether filename directory or not. both findfirstfile / findnextfile on windows , readdir on linux/os x tell whether files returned directories or not, no further stat system calls needed. in short, can reduce number of system calls 2n n, n total number of files , directories in tree.
in practice, removing system calls makes os.walk() 7-50 times fast on windows, , 3-10 times fast on linux , mac os x. we’re not talking micro-optimizations.
from python 3.5, pep 471, scandir
built-in, provided in os
package. small (untested) example:
for dentry in os.scandir("/path/to/dir"): if dentry.stat().st_size > max_value: print("{} biiiig".format(dentry.name))
(of course need stat
@ point, os.walk
called stat
implicitly when using function. also if files have specific extensions, perform stat
when extension matches, saving more)
and there's more it:
so, providing scandir() iterator function calling directly, python's existing os.walk() function can sped huge amount.
so migrating python 3.5+ magically speeds os.walk
without having rewrite code.
from experience, multiplying stat
calls on networked drive catastrophic performance-wise, if target network drive, you'll benefit enhancement more local disk users.
the best way performance on networked drives, though, run scan tool on machine on drive locally mounted (using ssh
instance). it's less convenient, it's worth it.
Comments
Post a Comment