Treasure Data's primary idea portal. 

Submit your ideas & feature requests directly to our product requirements team! We look forward to hearing from you.

Host the Running of Ad-Hoc Scripts within a Workflow

There are 3 primary use cases of running scripts we've heard from customers.

  1. Data Engineers want to collect data from misc APIs by writing a custom script and running it on a regular basis. These are APIs TD is unlikely to support directly, but they are happy to write a script that they manage themselves.
  2. Users want to sometimes run a validation check directly against the primary source - e.g. to "stop" a workflow if the validation fails. In one example a user wants to (a) pull data from Mixpanel, hit an API to get a JQL aggregated response, and then compare the count with the number of events collected into TD. In this case the check is important because Mixpanel's architecture is sometimes inconsistent in the data exported.
  3. Handling data processing that is not possible with SQL. 2 examples: (1) a user has object with a parent_id and a value that they want to recursively crawl to return data as a flattened hierarchy. (2) Running R or Python scripts that re-trains a Machine Learning model on a regular basis.

In this case they would like to have Treasure Data manage the resources the script runs on top of.

  • Rob Parrish
  • Jun 13 2017
  • In development
Active Requests?
Product Component Workflow Core
  • Jun 13, 2017

    Admin Response

    We are currently reviewing the feasibility of setting up such a resource processing cluster. Currently, users are able to run & manage scripts from workflows by using the `emr>` operator. 

    With this operator, users are able to either (a) have a emr-based EC2 machine started and ended around the processing step or (b) connect to an already running EMR cluster.