Skip to content

Uploading to Cloud Storage

The upload command uploads GeoParquet files to cloud object storage (S3, GCS, Azure) with parallel transfers and progress tracking.

Basic Usage

# Single file to S3
gpio upload input.parquet s3://bucket/path/output.parquet --profile my-profile

# Directory to S3
gpio upload data/ s3://bucket/dataset/ --profile my-profile

Supported Destinations

Provider support via URL scheme:

  • AWS S3 - s3://bucket/path/
  • Google Cloud Storage - gs://bucket/path/
  • Azure Blob Storage - az://account/container/path/
  • HTTP stores - https://...

Authentication

AWS S3

Use AWS profiles configured in ~/.aws/credentials:

gpio upload data.parquet s3://bucket/file.parquet --profile my-profile

Profile credentials are automatically loaded from AWS CLI configuration.

Google Cloud Storage

Uses application default credentials. Set up with:

gcloud auth application-default login

Azure Blob Storage

Uses Azure CLI credentials. Set up with:

az login

Options

Pattern Filtering

Upload only specific file types:

# Only JSON files
gpio upload data/ s3://bucket/dataset/ --pattern "*.json"

# Only Parquet files
gpio upload data/ s3://bucket/dataset/ --pattern "*.parquet"

Parallel Uploads

Control concurrency for directory uploads:

# Upload 8 files in parallel (default: 4)
gpio upload data/ s3://bucket/dataset/ --max-files 8

Trade-off: Higher parallelism = faster uploads but more bandwidth/memory usage.

Chunk Concurrency

Control concurrent chunks within each file:

# More concurrent chunks per file (default: 12)
gpio upload large.parquet s3://bucket/file.parquet --chunk-concurrency 20

Custom Chunk Size

Override default multipart upload chunk size:

# 10MB chunks instead of default 5MB
gpio upload data.parquet s3://bucket/file.parquet --chunk-size 10485760

Error Handling

By default, continues uploading remaining files if one fails:

# Stop immediately on first error
gpio upload data/ s3://bucket/dataset/ --fail-fast

Dry Run

Preview what would be uploaded without actually uploading:

gpio upload data/ s3://bucket/dataset/ --dry-run

Shows: - Files that would be uploaded - Total size - Destination paths - AWS profile (if specified)

Directory Structure

When uploading directories, the structure is preserved:

# Input structure:
data/
  ├── region1/
     ├── file1.parquet
     └── file2.parquet
  └── region2/
      └── file3.parquet

# After upload to s3://bucket/dataset/:
s3://bucket/dataset/region1/file1.parquet
s3://bucket/dataset/region1/file2.parquet
s3://bucket/dataset/region2/file3.parquet

See Also