We need flexibility in partitioning strategy (e.g., partitioning by user id, another datetime field, etc.) to make faster queries that cannot utilize 1-hour partitioning.
Here are several approaches in realizing this feature:
- Secondary indexes: This needs to be synchronized with the master data. Easy to use, but hard to maintain since it requires transaction.
- Materialized views: Take a snapshot of the data set, and apply arbitrary partitioning to make queries faster. This will not be updated frequently.
- Database cracking (Self-structure reorganization): Mixture of the above two approaches: finding the typical access patterns (column set, data range, etc.) from query histories, choose an optimal partitioning strategy and re-create mpc1 files to maximize Presto performance. An interesting thing in this approach is we can perform this re-organization when reading (querying) the data set.
We are planning to implement user defined partitioning within Treasure Data. It will be released with 1 extra dimension of partitioning - allowing users choose any column other than time to also partition (by hash buckets) the dataset within.