Filter Through Files In GCS Bucket Folder And Delete 0 Byte Files With Dataflow
I am currently trying to delete all the files that are 0 Bytes within a Google Cloud Storage bucket. I want to be able to do this with apache beam and a dataflow runner that will r
Solution 1:
You don't need to actually read the files in order to detect empty ones, you can just use the FileSystem object directly to check the file sizes and delete as needed. The FileMetadata object returned by the match() function includes the size of the files.
Something like
class DeleteEmpty(beam.DoFn):
def __init__(self, gfs):
self.gfs = gfs
def process(self, file_metadata):
if file_metadata.size_in_bytes == 0:
gfs.delete([file_metadata.path])
files = p | 'Filenames' >> beam.Create(gfs.match([<directory glob pattern>]).metadata_list)
| 'Reshuffle' >> beam.Reshuffle() # this allows the downstream code to be parallelized after the Create
| 'Delete empty files' >> beam.ParDo(DeleteEmpty(gfs))
GCS doesn't really have folders; they are just a convenience added when using the UI or gsutil. When there are no objects in a folder, that folder just doesn't exist. See https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
Post a Comment for "Filter Through Files In GCS Bucket Folder And Delete 0 Byte Files With Dataflow"