Get A List Of Subdirectories

December 27, 2023 Post a Comment

I know I can do this: data = sc.textFile('/hadoop_foo/a') data.count() 240 data = sc.textFile('/hadoop_foo/*') data.count() 168129 However, I would like to count the size of the d

Solution 1:

With python use hdfs module; walk() method can get you list of files.

The code sould look something like this:

from hdfs import InsecureClient

client = InsecureClient('http://host:port', user='user')
for stuff in client.walk(dir, 0, True):
...

With Scala you can get the filesystem (val fs = FileSystem.get(new Configuration())) and run https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path, boolean)

You can also execute a shell command from your script with os.subprocess but this is never a recommended approach since you depend on text output of a shell utility here.

Eventually, what worked for the OP was using subprocess.check_output():

subdirectories = subprocess.check_output(["hadoop","fs","-ls", "/hadoop_foo/"])

Python Dictionary

Get A List Of Subdirectories

Solution 1:

Post a Comment for "Get A List Of Subdirectories"