Architecture

Solution Overview

BucketArchiver’s Step Functions State Machine initiates an execution.
The State Machine provisions an ephemeral EC2 instance.
The EC2 instance reads data from the specified S3 input bucket.
The EC2 instance streams a compressed (gzip’ed tar) archive to the specified S3 output bucket.
Upon completion, the State Machine terminates the ephemeral EC2 instance.
The execution results are published to the BucketArchiver SNS topic.

Bucket Archive Files

BucketArchiver utilizes industry-standard, POSIX.1-1988 compliant tar and RFC 1952/RFC 1951 compatible parallel gzip implementations. The generated archives are multi-platform compatible and can be accessed on:

Any GNU/Linux distribution
MacOS
Most UNIX-like Operating Systems
Windows (using native tooling or Windows Subsystem for Linux)

CloudFormation Stack

BucketArchiver employs an AWS CloudFormation stack to manage its components. Interaction with BucketArchiver is performed through AWS CloudFormation interfaces (Console, CLI, API/SDK). Configurable parameters include:

S3 input bucket containing the data to archive
S3 output bucket for compressed archival storage
File and path patterns (e.g., ‘*.csv’) for inclusion in the archive
Desired EC2 instance type
Optional schedule for automated execution
Archive name
Maximum execution time (circuit breaker)
S3 and KMS access permissions

Modifications to the CloudFormation stack can be made at any time post-deployment to adjust these parameters.

State Machine

BucketArchiver orchestrates the archival workflow using an AWS Step Functions State Machine. Users can interact with the State Machine through AWS Step Functions interfaces (Console, CLI, API/SDK) to:

View execution history and statuses
Manually trigger archival executions

Ephemeral EC2 Instances

The State Machine provisions and terminates ephemeral EC2 instances as part of the archival workflow. These instances are optimized for archival tasks, leveraging parallel compression techniques on multi-core CPUs. The instances are based on an Amazon Linux 2023.

An Amazon SNS topic is provisioned to publish BucketArchiver State Machine execution results, including execution statuses and metadata. The SNS topic also supports optional email notifications.

Networking

BucketArchiver requires a VPC for EC2 instance provisioning. Two VPC deployment options are available:

A user-supplied existing VPC
A BucketArchiver-created dedicated VPC

User-Supplied Existing VPC

The VPC must allow the EC2 instance to access public endpoints for the following AWS services:

AWS Step Functions
Amazon Simple Queue Service (SQS)
Amazon Simple Storage Service (S3)
AWS Key Management Service (KMS) if in buckets use KMS keys
Amazon EC2
AWS IAM
Amazon CloudWatch

Dedicated VPC

Alternatively, a VPC tailored for BucketArchiver can be provisioned. This VPC features private subnets and AWS PrivateLink endpoints for secure, internal AWS service access. The VPC contains two subnets and each subnet includes the VPC endspoints:

AWS Step Functions
Amazon Simple Queue Service (SQS)
Amazon Simple Storage Service (S3)
AWS Key Management Service (KMS) if in buckets use KMS keys
Amazon EC2
AWS IAM
Amazon CloudWatch

Security

BucketArchiver is deployed within your AWS account, offering full control over data and compliance. IAM roles are provisioned for:

State Machine execution
EC2 instance S3 bucket access
AWS EventBridge schedule triggers

Monitoring and Logging

AWS CloudWatch is utilized for:

Logging State Machine executions and EC2 instance operations
Monitoring archival process metrics such as execution time and archive size

Limitations

BucketArchiver abides by AWS service limits, including a 5TB maximum object size for S3 limiting the resulting archive file size to 5TB.
A dedicated VPC can only access S3 buckets within its deployed AWS region.

Operations

CloudFormation Template Acquisition

You can obtain the BucketArchiver CloudFormation template from the AWS Marketplace. There are two template variants available:

Existing VPC Deployment: Use this template if you wish to deploy the BucketArchiver CloudFormation stack within your existing VPC.
Dedicated VPC Deployment: This template allows you to deploy the BucketArchiver stack along with a dedicated VPC.

CloudFormation Stack Deployment Steps

Navigate to the BucketArchiver product page on the AWS marketplace.
Select and deploy your desired template variant.
Configure the CloudFormation template to create a stack.
Optionally re-adjust the deployment parameters such as input bucket, output bucket, instance type, archive name, scheduler settings, and more.

We provide a dedicated and interactive deployment guide here: (https://www.bucketarchiver.com/deployment-guide/)[Deployment Guide].

The deployment usually completes with 3-5 minutes.

CloudFormation Parameters

Archive Parameters

Parameter Name	Type	Default Value	Min/Max	Constraints	Description and Examples
InputBucket	String	None	3-63	`[a-z0-9\-.]+`	Input S3 bucket name. Must have 3-63 characters; can contain lowercase letters, numbers, hyphens, and periods.
OutputBucket	String	None	3-63	`[a-z0-9\-.]+`	Output S3 bucket name. Similar constraints as `InputBucket`.
ArchivePattern	String	‘*’	N/A	Custom Pattern	Defines file or directory pattern for archiving. E.g., `''` for all files, `'.txt'` for all text files.
ArchiveName	String	‘archive’	1-128	Alphanumeric, `-`, `_`, `.`	Specifies the archive name. Resulting file will be `<YourSpecifiedName>_<ISODate>.tar.gz`. E.g., ‘archive_2023-09-15T14-30-45Z.tar.gz’ if name is ‘archive’.
MaxExecutionTime	Number	3600	1-86400	N/A	Max time for archival process in seconds. Ranges from 1 second to 24 hours.
Scheduler	String	None	1-256	Specific Formats	Scheduling expression for triggering. E.g., specific time: `at(2023-09-15T14:30:45)`, daily at specific time: `cron(30 14 * * ? *)`, or every 5 days: `rate(5 days)`.
EmailAddress	String	''	N/A	Email Format	Optional. Receives notifications when archival completes. E.g., `user@example.com` or leave empty.
KMSKeysRestrictionList	CommaDelimitedList	‘*’	N/A	ARN Format or `*`	Comma-separated list of KMS key ARNs to restrict access to. Use `*` for no restrictions. E.g., `arn:aws:kms:REGION:ACCOUNT_ID:key/KEY_ID,arn:aws:kms:REGION:ACCOUNT_ID:key/ANOTHER_KEY_ID`.
LogGroupRetentionDays	Number	7	N/A	1, 3	Number of days to retain log events in CloudWatch. Allowed values are 1 and 3.
InstanceType	String	‘t2.micro’	N/A	List of EC2 Types	EC2 instance type like `m5.large`, `t3.small`, etc.

Dedicated VPC Network Configuration Parameters

Parameter Name	Type	Default Value	Min/Max	Constraints	Description and Examples
VpcCidrBlock	String	‘10.10.10.0/24’	N/A	CIDR Format	VPC CIDR block for EC2 instances, e.g., `10.0.0.0/24`.
SubnetACidrBlock	String	‘10.10.10.0/25’	N/A	CIDR Format	CIDR block for Public Subnet A, e.g., `10.0.0.0/26`.
SubnetBCidrBlock	String	‘10.10.10.128/25’	N/A	CIDR Format	CIDR block for Public Subnet B, e.g., `10.0.0.128/26`.

(Re-)Configuring CloudFormation Stack(s)

Deployment and continuous configuration of BucketArchiver are executed through CloudFormation. After deploying the CloudFormation template with the initial settings, you can modify these parameters as needed.

To update a CloudFormation stack due to parameter alterations, you typically adjust the stack’s parameters using the AWS Management Console, AWS CLI, or SDKs. After finalizing the changes, you pass the revised parameter values to CloudFormation. The service then contrasts the existing stack with the new parameters and adjusts the stack accordingly to match your updated settings.

Updating Configuration Parameters

Via AWS Console
- Go to your desired BucketArchiver deployment grouping stack in the CloudFormation console
- Select ‘Update Stack’
- Under the ‘Parameters’ section, locate the desired parameter and change/input your desired configuration
Via AWS CLI When updating the stack via the AWS CLI, provide the parameter key and value using the --parameters option:

aws cloudformation update-stack 
--stack-name BucketArchiverStack-XXXYYY 
--use-previous-template 
--parameters ParameterKey=<KEY>,ParameterValue="<VALUE>"

Execution Modes

The Scheduler parameter in the BucketArchiver CloudFormation template is designed to dictate when the archival process is triggered. This scheduling leverages the cron and rate and at expressions from Amazon EventBridge, granting you precise control over the archival start timing. Depending on your expression you will be able to achieve

one time execution at a specific point in time
a rate of periodic execution within an interval of time (such as date/week/month)
a precise start time and periodic execution

Rate Expression

Format: rate(value unit)
Example: To execute the archival process daily, use: rate(1 day)

Cron Expression

Format: cron(minutes hours day-of-month month day-of-week year)
Example: To start the archival at 6 PM daily, use: cron(0 18 ? * * *)

At() Expression

The at() expression specifies a unique timestamp when an event should fire, ideal for one-off scheduled tasks.

Format: at(yyyy-mm-ddThh:mm:ss)
Example: To trigger an event for September 1, 2023, at 3:30:00 PM UTC, use: at(2023-09-01T15:30:00)

Using at() expression yields a one time execution at a point in time. If you specify a past point in time the execution will never be started automatically but can be triggered using an on-demand execution of the BucketArchiver AWS StepFunctions state machine. Please note that selecting a past date in the at expression will immedatly trigger an execution.

If you intend to use BucketArchiver for on-demand operation only (using the AWS console, StepFunctions CLI/API, etc.) we advise to use the at expression and configure a date in the far future, such as: at(2099-12-31T00:00:00)

On-demand Archive Workflow

The on-demand workflow allows you to manually trigger the BucketArchiver AWS StepFunctions state machine as needed. Follow these steps to execute an on-demand archival in

AWS Console

Open the AWS Management Console.
Navigate to AWS Step Functions.
Locate the BucketArchiver state machine and start a new execution.
Monitor the state machine’s execution and review the output in the designated output bucket.

AWS CLI

Using Bash

    REGION=$(aws configure get region)
    ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
    aws stepfunctions start-execution --state-machine-arn "arn:aws:states:$REGION:$ACCOUNT_ID:stateMachine:BucketArchiverStateMachine-<XXYYYY>" --name "Execution-1" --input '{}'

Using PowerShell

    $REGION = (Get-AWSCredential).Region
    $ACCOUNT_ID = (Get-STSCallerIdentity).Account
    aws stepfunctions start-execution --state-machine-arn "arn:aws:states:$REGION:$ACCOUNT_ID:stateMachine:BucketArchiverStateMachine-<XXXYYY>" --name "Execution-1" --input '{}'

Monitoring and Reporting

BucketArchiver integrates with CloudWatch metrics to streamline operational monitoring. In CloudWatch, metrics are organized into namespaces, serving as containers for specific metric categories. The BucketArchiver namespace contains:

EC2 Instance Metrics: Capturing performance and resource usage of ephemeral EC2 instances.
Archive Process Metrics: Metrics directly related to the archive creation process.

EC2 Machine Metrics

Dimensions: InstanceId & InstanceType

CPU Metrics: Detailed insights into CPU performance with metrics like cpu_usage_idle, cpu_usage_iowait, and more.
Network Metrics (TCP/IP): Derived from netstat, these metrics (e.g., tcp_established, net_bytes_recv) offer a comprehensive view of network activity.
Ethernet/Interface Metrics: Sourced from ethtool, metrics such as rx_packets, tx_packets, and several “allowance exceeded” indicators shed light on network interface performance.

BucketArchiver Archive Generation Metrics

Dimensions: InputBucket (indicating the source S3 bucket) & RunTimestamp (marking the task’s conclusion timestamp).

ExecutionTime: Represents the duration required to process an input bucket.
ArchiveSize: Specifies the size (in bytes) of the produced archive.

Logging

BucketArchiver logs important events and information to Amazon CloudWatch Logs. You can access the logs in the CloudWatch console under the /BucketArchiver-<CloudFormationStackId> log group. This log group contains the excecution log of the AWS Step Functions state machine and the output of the compression process on the EC2 instance. The following log streams are created as part of the log group and can be used for troubleshooting purposes:

Log Stream Name	Use Case	Notes
`states/BucketArchiverStateMachine/<date>/<execution-id>`	Log events of BucketArchiver state machine
`invoke/<instance-id>/log`	Log events of BucketArchiver EC2 invocation process	Use this to understand which objects where included in archival process.
`archive/<input-bucket-name>/<date>`	Log events of BucketArchiver archival and compression process	Use this to understand potential S3 bucket access permissions.

Performance

BucketArchiver internally uses a parallel compression tool capable of exploiting multiple CPU cores, providing a more scalable and faster compression of your buckets then traditional approaches. Compression and archival performance is dependant on multiple factors such as

Instance Type
Bucket data profile (amount of objects, size of objects, amount of S3 prefixes and the actual data in the objects
Network latency between EC2 instance and S3 bucket (cross region introduces signficant latency)

We adise the following procedure for optimizing effiency

Deploy BucketArchiver stack with a fairly small instance (such as a t3.medium)
Execute a run using AWS Step Functions state machine console or the CLI
Observe the CloudWatch metrics related to execution time, machine CPU utilization and networking.
Go to CloudFormation to adjust the instance type and select a different instance - for example a c5.2xlarge instance and execute the state machine workflow again.
Note the improvements in execution time (reduction) due to the larger instance size.

Picking Instance Types

When selecting an EC2 instance type for your BucketArchiver workflows, consider factors such as CPU, memory, and network performance. With the CloudFormation template we have preselected a wide selection of EC2 instance types that should provide optimum price/performance for BucketArhive operation.

Cross Region Workflows

BucketArchiver supports archival workflows across different AWS regions. Keep in mind that cross-region data transfers may impact performance and incur additional data transfer costs. The dedicated VPC deployment variant is limited to accessing S3 buckets in the same region as the CloudFormation stack.

Patching and Updates

While we maintain a very trimmed down version of the Amazon Linux based AMI for BucketArchiver and the instances deployed form this AMI are only emphemeral every now and then an patch update will be required. If there’s a crucial update for the BucketArchiver AMI, primarily related to critical CVEs (Common Vulnerabilities and Exposures), AWS Marketplace will send out a notification. Once you are notified we will provide a set of updates CloudFormation templates that use an AMI that is updated and not affected by the CVE. Updating BuckerArchiver involves updating the template of all BucketArchiver CloudFormation stacks that you have deployed.

Changelog

Initial Release