Partitioning Files¶
The partition commands split GeoParquet files into separate files based on column values or spatial indices.
Smart Analysis: All partition commands automatically analyze your strategy before execution, providing statistics and recommendations.
By String Column¶
Partition by string column values or prefixes:
# Preview partitions
gpio partition string input.parquet --column region --preview
# Partition by full column values
gpio partition string input.parquet output/ --column category
# Partition by first 2 characters
gpio partition string input.parquet output/ --column mgrs_code --chars 2
# Hive-style partitioning
gpio partition string input.parquet output/ --column region --hive
By H3 Cells¶
Partition by H3 hexagonal cells:
# Preview at resolution 7 (~5km² cells)
gpio partition h3 input.parquet --resolution 7 --preview
# Partition at default resolution 9
gpio partition h3 input.parquet output/
# Keep H3 column in output files
gpio partition h3 input.parquet output/ --keep-h3-column
# Hive-style (H3 column included by default)
gpio partition h3 input.parquet output/ --resolution 8 --hive
Column behavior:
- Non-Hive: H3 column excluded by default (redundant with path)
- Hive: H3 column included by default
- Use --keep-h3-column to explicitly keep
If H3 column doesn't exist, it's automatically added.
By KD-Tree¶
Partition by balanced spatial partitions:
# Auto-partition (default: ~120k rows each)
gpio partition kdtree input.parquet output/
# Preview auto-selected partitions
gpio partition kdtree input.parquet --preview
# Explicit partition count (must be power of 2)
gpio partition kdtree input.parquet output/ --partitions 32
# Exact computation (deterministic)
gpio partition kdtree input.parquet output/ --partitions 16 --exact
# Hive-style with progress tracking
gpio partition kdtree input.parquet output/ --hive --verbose
Column behavior:
- Similar to H3: excluded by default, included for Hive
- Use --keep-kdtree-column to explicitly keep
If KD-tree column doesn't exist, it's automatically added.
By Admin Boundaries¶
Split by administrative boundaries via spatial join with remote datasets:
How It Works¶
This command performs two operations:
- Spatial Join: Queries remote admin boundaries using spatial extent filtering, then spatially joins them with your data
- Partition: Splits the enriched data by administrative levels
Quick Start¶
# Preview GAUL partitions by continent
gpio partition admin input.parquet --dataset gaul --levels continent --preview
# Partition by continent
gpio partition admin input.parquet output/ --dataset gaul --levels continent
# Hive-style partitioning
gpio partition admin input.parquet output/ --dataset gaul --levels continent --hive
Multi-Level Hierarchical Partitioning¶
Partition by multiple administrative levels:
# Hierarchical: continent → country
gpio partition admin input.parquet output/ --dataset gaul --levels continent,country
# All GAUL levels: continent → country → department
gpio partition admin input.parquet output/ --dataset gaul --levels continent,country,department
# Hive-style multi-level (creates continent=Africa/country=Kenya/department=Accra/)
gpio partition admin input.parquet output/ --dataset gaul \
--levels continent,country,department --hive
# Overture Maps by country and region
gpio partition admin input.parquet output/ --dataset overture --levels country,region
Datasets¶
Two remote admin boundary datasets are supported:
| Dataset | Standard | Columns Added | Description |
|---|---|---|---|
gaul (default) |
GAUL naming + ISO 3166-1 alpha-3 | admin:continent, admin:country, admin:department |
FAO Global Administrative Unit Layers (GAUL) L2 - worldwide coverage with standardized naming |
overture |
Vecorel compliant (ISO 3166-1/2) | admin:country_code, admin:subdivision_code |
Overture Maps Divisions with ISO 3166 codes (219 countries, 3,544 regions) - docs |
Vecorel Compliance (Overture Dataset Only)¶
The overture dataset follows the Vecorel administrative division extension specification with standardized ISO codes:
admin:country_code(REQUIRED): ISO 3166-1 alpha-2 country code (e.g., "US", "AR", "DE")admin:subdivision_code: ISO 3166-2 subdivision code WITHOUT country prefix (e.g., "CA" not "US-CA")
The tool automatically transforms Overture's native region codes (e.g., "US-CA") to strip the country prefix for Vecorel compliance.
Note: The GAUL dataset uses FAO's standardized naming system but is NOT Vecorel compliant:
- Has ISO 3166-1 alpha-3 codes (e.g., "TZA"), but Vecorel requires alpha-2 (e.g., "TZ")
- Uses GAUL's standardized naming for subnational units, not ISO 3166-2 codes
- Columns: admin:continent (continent name), admin:country (GAUL country name), admin:department (GAUL L2 name)
Notes¶
- Overture dataset: Vecorel compliant with ISO 3166-1 alpha-2 and ISO 3166-2 codes
- GAUL dataset: FAO standardized naming system - source.coop GAUL L2
- Performs spatial intersection to assign admin divisions based on geometry
- Requires internet connection to access remote datasets
- Uses spatial extent filtering and bbox columns for optimization
Common Options¶
All partition commands support:
# Compression settings
--compression [ZSTD|GZIP|BROTLI|LZ4|SNAPPY|UNCOMPRESSED]
--compression-level [1-22]
# Row group sizing
--row-group-size [exact row count]
--row-group-size-mb [target size like '256MB' or '1GB']
# Workflow options
--dry-run # Preview SQL without executing
--verbose # Detailed output
--preview # Preview results (partition commands)
--hive # Use Hive-style partitioning
--overwrite # Overwrite existing files
--preview-limit 15 # Number of partitions to show (default: 15)
--force # Override analysis warnings
--skip-analysis # Skip analysis (performance-sensitive cases)
--prefix PREFIX # Custom filename prefix (e.g., 'fields' → fields_USA.parquet)
Output Structures¶
Standard Partitioning¶
output/
├── partition_value_1.parquet
├── partition_value_2.parquet
└── partition_value_3.parquet
Hive-Style Partitioning¶
output/
├── column=value1/
│ └── data.parquet
├── column=value2/
│ └── data.parquet
└── column=value3/
└── data.parquet
Custom Filename Prefix¶
Add --prefix NAME to prepend a custom prefix to partition filenames:
# Standard: fields_USA.parquet, fields_Kenya.parquet
gpio partition admin input.parquet output/ --dataset gaul --levels country --prefix fields
# Hive: country=USA/fields_USA.parquet, country=Kenya/fields_Kenya.parquet
gpio partition admin input.parquet output/ --dataset gaul --levels country --prefix fields --hive
Partition Analysis¶
Before creating files, analysis shows:
- Total partition count
- Rows per partition (min/max/avg/median)
- Distribution statistics
- Recommendations and warnings
Warnings trigger for: - Very uneven distributions - Too many small partitions - Single-row partitions
Use --force to override warnings or --skip-analysis for performance.
Preview Workflow¶
# 1. Preview to understand partitioning
gpio partition h3 large.parquet --resolution 7 --preview
# 2. Adjust resolution if needed
gpio partition h3 large.parquet --resolution 8 --preview
# 3. Execute when satisfied
gpio partition h3 large.parquet output/ --resolution 8
See Also¶
- CLI Reference: partition
- add command - Add spatial indices before partitioning