Amazon S3 multi-part upload is a feature provided by Amazon's Simple Storage Service (S3) that allows large files to be uploaded in smaller, more manageable parts. This method is particularly useful for large files or when the network conditions are unstable. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.
Benefits of using multi-part upload
Improved throughput – You can upload parts in parallel to improve throughput.
Quick recovery from any network issues – Smaller part size minimizes the impact of restarting a failed upload due to a network error.
Pause and resume object uploads – You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.
Begin an upload before you know the final object size – You can upload an object as you are creating it.
How Multi-Part Upload Works
Multipart upload is a three-step process: You initiate the upload, you upload the object parts, and after you have uploaded all the parts, you complete the multipart upload. Upon receiving the complete multipart upload request, Amazon S3 constructs the object from the uploaded parts, and you can then access the object just as you would any other object in your bucket. Here's a breakdown of the process:
Initialization: The upload is initiated, and Amazon S3 returns an upload ID. You must include this upload ID whenever you upload parts, list the parts, complete an upload, or stop an upload. If you want to provide any metadata describing the object being uploaded, you must provide it in the request to initiate multipart upload.
Uploading Parts: When uploading a part, in addition to the upload ID, you must specify a part number. You can choose any part number between 1 and 10,000. A part number uniquely identifies a part and its position in the object you are uploading. The part number that you choose doesn’t need to be in a consecutive sequence (for example, it can be 1, 5, and 14). If you upload a new part using the same part number as a previously uploaded part, the previously uploaded part is overwritten.
❗After you initiate a multipart upload and upload one or more parts, you must either complete or stop the multipart upload to stop getting charged for storage of the uploaded parts. Only after you either complete or stop a multipart upload will Amazon S3 free up the parts storage and stop charging you for the parts storage.Part Size: Each part, except the last one, must be at least 5 MB in size. The maximum size of a part can be up to 5 GB.
Tracking: Whenever you upload a part, Amazon S3 returns an entity tag (ETag) header in its response. For each part upload, you must record the part number and the ETag value. The ETag value for all object part uploads will remain the same, but each part will be assigned a different part number. You must include these values in the subsequent request to complete the multipart upload.
Completion: When you complete a multipart upload, Amazon S3 creates an object by concatenating the parts in ascending order based on the part number. If any object metadata was provided in the initiate multipart upload request, Amazon S3 associates that metadata with the object. After a successful complete request, the parts no longer exist.
Your complete multipart upload request must include the upload ID and a list of both part numbers and corresponding ETag values. The Amazon S3 response includes an ETag that uniquely identifies the combined object data. This ETag is not necessarily an MD5 hash of the object data.
Requirements for Uploaded Files
File Size: Ideal for files larger than 100 MB. The maximum file size that can be uploaded in a single operation is 5 TB.
Part Size and Number: Each part must be at least 5 MB and can be up to 5 GB. The total number of parts can be up to 10,000.
Stable Network Connection: While multi-part upload is more resilient to network issues, a reasonably stable connection is still necessary for uploading large files.
Use Cases
Large Data Sets: Ideal for uploading large datasets like video files, scientific data, or backup archives.
Unstable Network Conditions: Useful when uploading from locations with unreliable or slow internet connections, as it allows for resuming uploads.
Streamlined Workflows: Facilitates processes where different parts of a file are generated or processed by different systems and then assembled in S3.
Efficiency and Speed: Allows for parallel uploads of different parts, significantly speeding up the upload process for large files.
Amazon S3 multi-part upload is a flexible and efficient way to handle large file uploads, especially in scenarios where file size, network reliability, and upload efficiency are critical factors.
Tutorial: Upload 100 MB size file with S3 multi-part upload by using AWS CLI
Prerequisites
Before proceeding with the tutorial, you need to have a 100 MB size file. By using the following Python code you can generate a text file filled with some random data:
# Generating a 100 MB text file
file_path = "/my_path/100MB_file.txt" # Specify the local path to save the file
# Writing random data to the file until it reaches approximately 100 MB
# Each character in Python is 1 byte, so writing 100 million characters will create a file of roughly 100 MB
with open(file_path, "w") as file:
file.write("0" * 100000000) # Writing 100 million zeros
file_path
Now, let's go through the step-by-step guide on how to upload this file to Amazon S3 using the AWS CLI and the multi-part upload method.
Steps
1. Install and Configure AWS CLI
If you haven't already, install the AWS Command Line Interface (CLI) and configure it with your AWS credentials. You can find instructions on how to do this in the AWS CLI User Guide.
2. Create a Bucket (if necessary)
If you don't have an S3 bucket to upload the file to, create one using the following command:
aws s3 mb s3://your-bucket-name
Replace your-bucket-name
with your desired bucket name.
3. Initiate Multi-Part Upload
Begin the multi-part upload process and get an upload ID:
aws s3api create-multipart-upload --bucket your-bucket-name --key your-file-name.txt --output json
Note the UploadId
in the response. You'll need this for the next steps.
4. Upload Parts
Split your 100 MB file into smaller parts. Each part must be at least 5 MB. You can use a file splitting Bash script for this. The details of the script can be found here: https://github.com/Brain2life/bash-cookbook/tree/split-file
Upload each part separately using the following command:
aws s3api upload-part --bucket your-bucket-name --key your-file-name.txt --part-number N --body file-part-N --upload-id your-upload-id
Replace N
with the part number (starting from 1) and file-part-N
with the filename of the part. Repeat this for each part.
5. Complete the Upload
After all parts are uploaded, complete the multi-part upload by providing a list of the uploaded parts:
aws s3api complete-multipart-upload --multipart-upload file://file-parts.json --bucket your-bucket-name --key your-file-name.txt --upload-id your-upload-id
The file-parts.json
should contain information about the part numbers and their respective ETags (which you get as a response when you upload each part). Here is an example structure of what file-parts.json
should look like:
{
"Parts": [
{
"PartNumber": 1,
"ETag": "\"etag-part1\""
},
{
"PartNumber": 2,
"ETag": "\"etag-part2\""
},
{
"PartNumber": 3,
"ETag": "\"etag-part3\""
}
// ... continue for each part
]
}
6. Verify Upload
Finally, you can verify the upload:
aws s3 ls s3://your-bucket-name/your-file-name.txt
This process allows you to efficiently upload large files to Amazon S3 in parts, which is especially useful for handling network issues or for speeding up the upload process by parallelizing the parts. Remember to replace placeholders like your-bucket-name
, your-file-name.txt
, your-upload-id
, etc., with actual values relevant to your use case.