Uploading in parallel to Google Cloud Storage
Introduction
Have you ever found yourself struggling to move large amounts of data between your local machine and Google Cloud Storage (GCS)? Tools like gsutil and the GCS Transfer Manager are here to make your life easier. These powerful tools are designed to simplify and accelerate data transfers, saving you time and effort.
In this blog, we'll dive into the world of parallel file uploads. We'll explore the key features, benefits, and how to effectively use these tools to streamline your data transfer workflows. Whether you're a seasoned cloud user or just getting started, this guide will provide valuable insights.
Let’s get started!
gsutil: A command line tool for parallel uploads
gsutil is a Python application that lets you access Cloud Storage from the command line. You can use gsutil to do a wide range of bucket and object management tasks, including:
- Creating and deleting buckets.
- Uploading, downloading, and deleting objects.
- Listing buckets and objects.
- Moving, copying, and renaming objects.
- Editing objects and bucket ACLs.
In this blog, we will explore the functionality of uploading multiple objects in parallel.
How does gsutil do parallel uploads?
Parallel composite upload strategy
Imagine trying to move a massive bookshelf across town. It's heavy, awkward, and could take hours.
Now, imagine breaking that bookshelf down into smaller, more manageable sections. You could then transport those sections simultaneously, significantly speeding up the move.
That's essentially how parallel composite uploads work. When you upload a large file to Google Cloud Storage, it's broken down into up to 32 smaller chunks. These chunks are then uploaded in parallel, like moving those bookshelf sections. This can dramatically reduce upload time, especially for files over 100 MB.
Here's a catch: The amount of speed up that this strategy provides is bounded by the network I/O capacity. This is essentially the bandwidth of your internet connection. If your connection can handle high speeds, you'll see a significant reduction in upload time when using parallel uploads. However, if your connection is slower, the speed gain might be limited.
Inspecting a parallel composite upload in gsutil
During parallel composite uploads, GSUTIL splits your larger files into smaller equal sized chunks on your local disk. After that, it will initiate a parallel upload into your GCS bucket, and recombine all those pieces back to a since file once they’re uploaded. This is known as the parallel composite upload strategy. The below command shall copy large number of files in the current directory to the specified GCS bucket in no time:
gsutil -m cp -r gs://my-bucket/files/ .
Some metrics to compare blob vs parallel composite upload strategy
Here’s a graph showing 100 instances of uploading a 500MB file with regular upload strategy, and with composite upload strategy..
It is clearly evident that parallel composite uploads perform much better than direct blob uploads for larger files. The peak upload time jumped close to 8 seconds for blob whereas it was only as high as 4 seconds for parallel composite, which almost is the half of the simple upload latency
Recombining the divided chunks
The smaller chunks are recombined using the compose request. The compose request takes between 1 and 32 objects, and creates a new composite object. The composite object is a concatenation of the source objects in the order they were specified in the request.
Required roles and permissions
Roles
- (
roles/storage.objectUser
) IAM role
Permissions
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list
You can refer to this link to learn more about creating composite objects from smaller chunks.
Optimizing uploads using custom chunk size
To set the chunk size explicitly, set the *parallel_composite_upload_threshold*
option on gsutil (or, updating your .boto file, like the console output suggests).
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./somebigfile gs://your-bucket
Here, somebigfile is a file that is larger than 150MB. This divides up your data into chunks ~150MB and uploads them in parallel, increasing upload performance. (Note, there are some restrictions on the number of chunks that can be used. Refer to the documentation for more information.)
Transfer manager: A simpler way of managing GCS uploads
Introduction
The GCS transfer manager module is a powerful tool for managing data transfers to and from Google Cloud Storage (GCS). It provides a convenient way to upload and download files, as well as perform other transfer operations, with minimal coding effort.
Key features and benefits of the GCS transfer manager module:
- Simplified data transfers: The module handles many of the complexities of data transfers, such as authentication, authorization, and error handling, making it easy for developers to focus on their core application logic.
- Parallel transfers: The module can transfer multiple files simultaneously, significantly improving performance for large datasets.
- Resumable transfers: If a transfer is interrupted, the module can resume it from the point of failure, avoiding unnecessary data transfers.
- Progress tracking: The module provides real-time updates on the progress of transfers, allowing developers to monitor their progress and identify potential issues.
- Error handling: The module includes built-in error handling mechanisms to help developers handle common errors and avoid data loss.
Additional Features
- Customizable configuration: The GCS transfer manager module can be configured to meet your specific needs, such as specifying the number of parallel transfers or the maximum number of retries for failed transfers.
- Integration with other Google Cloud services: The module can be integrated with other Google Cloud services, such as Cloud Functions and Cloud Scheduler, to automate data transfer workflows.
The GCS transfer manager module is a powerful and flexible tool that can simplify data transfers to and from GCS. By understanding its key features and benefits, you can effectively use it to improve the efficiency and reliability of your data transfer workflows.
How to use the GCS transfer manager module:
Install the Google cloud storage module
pip install google-cloud-storage
Use it in your codebase in the following way to upload multiple files from a given directory
from pathlib import Path
from google.cloud import storage
from google.cloud.storage import transfer_manager
def upload_files_to_gcs(source_directory: str):
"""
Uses Transfer manager to upload files in parallel to a GCS bucket.
Args:
source_directory (str): The local path of the source directory to upload files from.
Returns:
"""
try:
PROJECT_ID=”your_project_id_here”
client = storage.Client(project=PROJECT_ID)
bucket = client.bucket(“your_bucket_name_here”)
if not source_directory:
logging.warning(“No local directory provided”)
return
directory_as_path_obj = Path(source_directory)
paths = directory_as_path_obj.glob('*')
# Get the list of file paths, relative to the source directory and store them as strings
file_paths = [path for path in paths if path.is_file()]
relative_paths = [path.relative_to(source_directory) for path in file_paths]
string_paths = [str(path) for path in relative_paths]
prefix=”test_prefix/”
transfer_manager.upload_many_from_filenames(
bucket=bucket,
filenames=string_paths,
source_directory=source_directory,
blob_name_prefix=prefix)
except Exception as e:
logging.error(f”Failed to upload files to GCS due to exception: {e}”)
Selecting the correct tool based on your use-case
Both gsutil and the GCS Transfer Manager are powerful tools for managing data transfers to and from Google Cloud Storage (GCS). However, they have different strengths and weaknesses, and the best choice for you will depend on your specific use case.
When to use gsutil?
While the GCS Transfer Manager offers a user-friendly interface, there are situations where gsutil, a command-line tool, might be more suitable:
- Advanced Features and Customization: If you require granular control over your GCS operations, such as setting specific ACLs, managing object lifecycle policies, or performing complex data manipulation, gsutil provides a wider range of commands and options.
- Scripting and Automation: For tasks that involve repetitive or complex GCS interactions, gsutil's command-line interface can be easily integrated into scripts or automation workflows. This is particularly useful for tasks like nightly backups, data synchronization, or building CI/CD pipelines.
- Integration with Other Tools: gsutil can be seamlessly integrated with other command-line tools and scripting languages, making it a versatile choice for data pipelines and workflows.
- Complex Data Manipulation: If you need to perform complex data manipulation tasks, such as filtering, sorting, or transforming data, gsutil's powerful commands can be combined to achieve the desired results.
When to use GCS Transfer manager?
- You want to initiate transfers via JS or Python code: The GCS Transfer manager module is a great choice while integrating parallel uploads to your server code.
- You need to transfer large files or many files at once: The Transfer Manager's parallel transfer capabilities can significantly speed up transfers for large datasets.
- You want to easily resume interrupted transfers: The Transfer Manager allows you to pause and resume transfers, making it easier to recover from network disruptions or other issues.
- You want to track the progress of your transfers in real time: The Transfer Manager provides detailed information on transfer speed, remaining time, and other metrics, allowing you to monitor the progress of your transfers.
Conclusion
We compared gsutil and transfer manager for parallel file uploads, understanding their strengths and use cases. We also delved into the parallel composite upload strategy, a popular method for uploading large files to the cloud.
References:
-
gsutil documentation: https://cloud.google.com/storage/docs/gsutil
-
Composing objects: https://cloud.google.com/storage/docs/composing-objects
-
Parallel composite uploads: https://cloud.google.com/storage/docs/parallel-composite-uploads
-
Transfer manager documentation: https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.transfer_manager
-
Code (Python): https://github.com/googleapis/python-storage/blob/9998a5e1c9e9e8920c4d40e13e39095585de657a/samples/snippets/storage_transfer_manager.py