Use Docker For Google Cloud Data Flow Dependencies
Solution 1:
2021 update
Dataflow now supports custom docker containers. You can create your own container by following these instructions:
https://cloud.google.com/dataflow/docs/guides/using-custom-containers
The short answer is that Beam publishes containers under dockerhub.io/apache/beam_${language}_sdk:${version}
.
In your Dockerfile you would use one of them as base:
FROM apache/beam_python3.8_sdk:2.30.0
# Add your customizations and dependencies
Then you would upload this image to a container registry like GCR or Dockerhub, and then you would specify the following option: --worker_harness_container_image=$IMAGE_URI
And bing! you have a customer container.
It is not possible to modify or switch the default Dataflow worker container. You need to install the dependencies according to the documentation.
Solution 2:
If you have a large number of videos you will have to incur the large startup cost regardless. Thus is the nature of Grid Computing in general.
The other side of this is that you could use larger machines under the job than the n1-standard-1 machines, thus amortizing the cost of the download across less machines that could potentially process more videos at once if the processing was coded correctly.
Solution 3:
One solution is to issue the pip install commands through the setup.py option listed for Non-Python Dependencies.
Doing this will download the manylinux wheel instead of the source distribution that the requirements file processing will stage.
Post a Comment for "Use Docker For Google Cloud Data Flow Dependencies"