Skip to content Skip to sidebar Skip to footer

Filter Through Files In GCS Bucket Folder And Delete 0 Byte Files With Dataflow

I am currently trying to delete all the files that are 0 Bytes within a Google Cloud Storage bucket. I want to be able to do this with apache beam and a dataflow runner that will r

Solution 1:

You don't need to actually read the files in order to detect empty ones, you can just use the FileSystem object directly to check the file sizes and delete as needed. The FileMetadata object returned by the match() function includes the size of the files.

Something like

class DeleteEmpty(beam.DoFn):
  def __init__(self, gfs):
    self.gfs = gfs

  def process(self, file_metadata):
    if file_metadata.size_in_bytes == 0:
      gfs.delete([file_metadata.path])

files = p | 'Filenames' >> beam.Create(gfs.match([<directory glob pattern>]).metadata_list)
          | 'Reshuffle' >> beam.Reshuffle() # this allows the downstream code to be parallelized after the Create
          | 'Delete empty files' >> beam.ParDo(DeleteEmpty(gfs))

GCS doesn't really have folders; they are just a convenience added when using the UI or gsutil. When there are no objects in a folder, that folder just doesn't exist. See https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork


Post a Comment for "Filter Through Files In GCS Bucket Folder And Delete 0 Byte Files With Dataflow"