Table of contents
- Source and Destination Buckets:
- Use Cases of Amazon S3 Inventory:
- Data Collected by Amazon S3 Inventory:
- S3 ListObjects API call vs Amazon S3 Inventory
- Tutorial: Configuring Amazon S3 Inventory by using the S3 console
- 1. Set Up an AWS Account
- 2. Create an S3 Bucket
- 3. Upload Some Files to Your Bucket
- 4. Create a Second Bucket for Storing Inventory Reports
- 5. Enable S3 Inventory on the First Bucket
- 6. Wait for the Inventory Report to be Generated
- 7. Accessing the Inventory Report
- 8. Analyze the Inventory Data
- Additional Tips:
- References:
Amazon S3 Inventory is a feature provided by Amazon Web Services (AWS) that helps manage storage within the Amazon Simple Storage Service (S3). It provides a scheduled report of objects in an S3 bucket or a shared prefix, offering insights into the storage usage, and aiding in efficient management and analysis of data stored in Amazon S3.
Amazon S3 Inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or objects with a shared prefix (that is, objects that have names that begin with a common string). If you set up a weekly inventory, a report is generated every Sunday (UTC time zone) after the initial report. For information about Amazon S3 Inventory pricing, see Amazon S3 pricing.
You can configure multiple inventory lists for a bucket. When you're configuring an inventory list, you can specify the following:
What object metadata to include in the inventory
Whether to list all object versions or only current versions
Where to store the inventory list file output
Whether to generate the inventory on a daily or weekly basis
Whether to encrypt the inventory list file
You can query Amazon S3 Inventory with standard SQL queries by using Amazon Athena, Amazon Redshift Spectrum, and other tools, such as Presto, Apache Hive, and Apache Spark. For more information about using Athena to query your inventory files, see Querying Amazon S3 Inventory with Amazon Athena.
Source and Destination Buckets:
In the context of Amazon S3 Inventory, the terms "source bucket" and "destination bucket" refer to two distinct roles within the S3 inventory process.
Each of these buckets serves a specific purpose and contains different types of files:
Source Bucket
Role: The source bucket is the Amazon S3 bucket about which you want to collect inventory information. It's the target of the inventory process.
Contents: This bucket contains your actual data—these are the objects (files) that you've stored in Amazon S3. The contents can vary widely depending on your use case; it might include documents, images, videos, backups, application data, etc.
Purpose in S3 Inventory: The S3 Inventory service scans the objects in this bucket to generate the inventory report. It looks at each object and collects specified metadata, such as the object's key (name), size, last modified date, storage class, and other information based on your inventory configuration settings.
Destination Bucket
Role: The destination bucket is where the inventory reports generated by the S3 Inventory service are stored. It's separate from the source bucket.
Contents: This bucket contains the inventory report files. These are not the actual data files but rather metadata reports about the objects in the source bucket. The format of these reports is usually CSV or ORC, as specified in the inventory configuration.
Purpose in S3 Inventory: The destination bucket acts as a repository for the inventory reports. Once generated, these reports are automatically stored in this bucket. You can then access and analyze these reports to gain insights into the data stored in the source bucket.
The source bucket is your original data storage in S3, and the S3 Inventory service analyzes this bucket to create a report. The destination bucket is where these generated reports are stored, allowing you to access and review the metadata about your stored objects without directly querying the source bucket. This separation ensures that the inventory process does not interfere with the primary storage and operation of your data.
Use Cases of Amazon S3 Inventory:
Audit and Report: Assists in auditing and reporting the storage by providing a list of objects, helping users to maintain compliance and understand their data landscape.
Data Lifecycle Management: Facilitates the management of object lifecycles by providing details about the objects, which can be used to apply appropriate lifecycle policies.
Optimizing Storage Costs: Helps in identifying older or infrequently accessed data that can be moved to more cost-effective storage classes like S3 Glacier.
Data Replication Checks: Useful for ensuring data replication is consistent across S3 buckets, particularly in different geographic locations.
Security and Access Control: Enables monitoring of object access control settings, helping to maintain security and privacy standards.
Big Data and Analytics Preparation: Simplifies the task of gathering metadata for objects stored in S3, making it easier to prepare for big data and analytics projects.
Data Collected by Amazon S3 Inventory:
An inventory list file contains a list of the objects in the source bucket and metadata for each object. An inventory list file is stored in the destination bucket with one of the following formats:
As a CSV file compressed with GZIP
As an Apache optimized row columnar (ORC) file compressed with ZLIB
As an Apache Parquet file compressed with Snappy
An inventory list file contains a list of the objects in the source bucket and metadata for each listed object:
Object Metadata: Includes the key name (path and filename), version ID (if versioning is enabled), and the last modified date of each object.
Storage Class Information: Specifies the storage class of each object, which is useful for understanding and managing storage costs.
Encryption Status: Indicates whether the object is encrypted and the type of encryption used.
Object Size: Provides the size of each object, helping in storage management and cost estimation.
ETag: The entity tag of the object, which is a hash of the object data, useful for data integrity checks.
Replication Status: Indicates if the object has been replicated to another bucket, particularly important for cross-region replication scenarios.
Delete marker – Set to
True
if the object is a delete marker. For more information, see Using versioning in S3 buckets. (This field is automatically added to your report if you've configured the report to include all versions of your objects).S3 Object Lock retain until date – The date until which the locked object cannot be deleted. For more information, see Using S3 Object Lock.
S3 Object Lock retention mode – Set to
Governance
orCompliance
for objects that are locked. For more information, see Using S3 Object Lock.S3 Object Lock legal hold status – Set to
On
if a legal hold has been applied to an object. Otherwise, it is set toOff
. For more information, see Using S3 Object Lock.Multipart upload flag – Set to
True
if the object was uploaded as a multipart upload. For more information, see Uploading and copying objects using multipart upload.S3 Bucket Key status – Set to
ENABLED
orDISABLED
. Indicates whether the object uses an S3 Bucket Key for SSE-KMS. For more information, see Using Amazon S3 Bucket Keys.Checksum algorithm – Indicates the algorithm that's used to create the checksum for the object.
Object access control list – An access control list (ACL) for each object that defines which AWS accounts or groups are granted access to this object and the type of access that is granted. The Object ACL field is defined in JSON format. For more information, see Working with the Object ACL field and Access control list (ACL) overview.
Object owner – The owner of the object.
The inventory report can be configured to be generated daily or weekly and can be outputted in CSV or ORC (Optimized Row Columnar) formats for easy analysis. The report is stored in a specified bucket, and users can use various AWS or third-party tools to analyze this data further.
S3 ListObjects API call vs Amazon S3 Inventory
Amazon S3 Inventory and the S3 ListObjects
API call serve different purposes and their effectiveness depends on the specific use case. Here are some key points to consider when comparing the two:
Amazon S3 Inventory
Scheduled Reports: S3 Inventory provides a scheduled report of objects in a bucket or a prefix. These reports can be generated daily or weekly.
Comprehensive Metadata: It includes detailed metadata about each object, like size, last modified date, storage class, encryption status, and more.
Large Buckets: Ideal for buckets with a large number of objects, as it provides a more scalable solution to list and report on objects.
Analytics and Audit: Useful for data audits, analytics, and compliance reporting, as it provides a comprehensive and downloadable report.
Cost-Effectiveness: While it might introduce some additional costs for the storage of reports, it can be more cost-effective for large-scale data analysis compared to repeatedly calling the API.
Indirect Access: The results are not real-time but are instead provided based on the scheduled generation time.
S3 ListObjects API Call
Real-Time Data: Provides real-time listing of objects in a bucket. It's suitable for applications that need current information about stored objects.
Immediate Response: Useful for interactive applications where users or processes need immediate information about the objects in S3.
API Request Limits: There are limits on the number of items returned in a single call (up to 1,000), and pagination is needed for larger buckets.
Cost per Request: Can be more expensive for large buckets due to the cost associated with the large number of API calls required to list all objects.
Limited Metadata: Provides basic information like the object name and size, but lacks the detailed metadata that S3 Inventory reports offer.
Which to Use When?
For Large Buckets and Comprehensive Analysis: S3 Inventory is better suited for large buckets where you need detailed reports on a regular basis for analysis, audit, or compliance purposes.
For Immediate Access and Interactive Applications: The
ListObjects
API is more appropriate for applications that require immediate access to object information, especially when dealing with a smaller number of objects.
In summary, S3 Inventory is generally a better option for comprehensive and periodic analysis of large buckets, while the ListObjects
API is preferable for real-time, interactive access to bucket contents, especially when dealing with smaller datasets. The choice largely depends on the specific requirements of the task at hand.
Tutorial: Configuring Amazon S3 Inventory by using the S3 console
Testing the concept of Amazon S3 Inventory and gaining hands-on experience involves several steps, from setting up an S3 bucket to configuring the inventory and accessing the inventory report. Here's a step-by-step tutorial:
1. Set Up an AWS Account
- If you don’t already have an AWS account, create one at AWS website.
2. Create an S3 Bucket
Log in to the AWS Management Console.
Navigate to the S3 service.
Click on “Create bucket”.
Provide a unique name for your bucket and select a region.
Leave the other settings at their defaults for this test, and click “Create”.
3. Upload Some Files to Your Bucket
Once the bucket is created, click on its name to open it.
Click on “Upload” and select a few files from your computer to upload.
After selecting files, click “Upload”.
4. Create a Second Bucket for Storing Inventory Reports
Repeat the bucket creation process to create a second bucket. This bucket will be used to store the inventory reports.
5. Enable S3 Inventory on the First Bucket
Go to the bucket where you uploaded files.
Click on the “Management” tab.
Find the “Inventory” section and click “Create inventory configuration”.
Fill out the inventory configuration:
Name: Give a descriptive name for the inventory.
Filter: You can specify a prefix to inventory only objects with that prefix, or leave it blank to inventory all objects. Also choose between current versions of the object files or include all versions.
Destination: Specify the bucket you created for storing inventory reports. You may also need to specify a prefix within this bucket.
Output format: Choose CSV, ORC or Apache Parquet.
Frequency: Choose whether the report should be generated daily or weekly.
Additional fields: Select additional fields you want in the report, like Size, Last Modified Date, Storage Class, etc.
Click “Create”.
When an inventory list is published, you can query the inventory list file with Amazon S3 Select. For more information about how to locate your inventory list and query the inventory list file with Amazon S3 Select, see Locating your inventory list.
6. Wait for the Inventory Report to be Generated
Amazon S3 Inventory reports are generated according to the specified schedule (daily or weekly). Depending on your choice, you may need to wait up to 48 hours for a daily report or up to a week for a weekly report.
7. Accessing the Inventory Report
Once the report is generated, go to the second bucket where the reports are being stored.
You will find the inventory report in the specified format. If it's in CSV format, you can download and open it with any spreadsheet software like Microsoft Excel or Google Sheets.
8. Analyze the Inventory Data
Open the report to view details about the objects in your S3 bucket.
You can use this data for various purposes like auditing, cost analysis, or data lifecycle management.
Additional Tips:
Remember that S3 Inventory can result in additional costs for the storage of the inventory reports and the operations performed.
You can automate the analysis of inventory reports using AWS Lambda functions or other AWS services like Athena for larger datasets.
Regularly review your S3 bucket and inventory configurations to optimize storage and costs.
By following these steps, you will have a practical understanding of how Amazon S3 Inventory works and how it can be utilized for managing objects in S3 buckets.