Treasure Data's primary idea portal. 

Submit your ideas & feature requests directly to our product requirements team! We look forward to hearing from you.

Faster Querying by Partitioning on Multiple Columns

We need flexibility in partitioning strategy (e.g., partitioning by user id, another datetime field, etc.) to make faster queries that cannot utilize 1-hour partitioning.

Here are several approaches in realizing this feature:

 - Secondary indexes: This needs to be synchronized with the master data. Easy to use, but hard to maintain since it requires transaction. 

 - Materialized views: Take a snapshot of the data set, and apply arbitrary partitioning to make queries faster. This will not be updated frequently.

 - Database cracking (Self-structure reorganization): Mixture of the above two approaches: finding the typical access patterns (column set, data range, etc.) from query histories, choose an optimal partitioning strategy and re-create mpc1 files to maximize Presto performance. An interesting thing in this approach is we can perform this re-organization when reading (querying) the data set. 

  • Taro L. Saito
  • May 11 2016
  • In development
  • Jun 19, 2017

    Admin Response

    We are planning to implement user defined partitioning within Treasure Data. It will be released with 1 extra dimension of partitioning - allowing users choose any column other than time to also partition (by hash buckets) the dataset within.