14 Indexing Rewards Proportions

Now that we have a list of subgraphs we have synced, we want to figure out which ones offer the best indexing rewards returns. To do this, we want to pull in the number of allocated tokens on each subgraph deployment, as well as the curation signal. From there, we will be able to figure out which ones have the largest amount of signal relative to the allocation sizes, and maximize our returns.

This is the base query we will use:

query = '
query MyQuery($created_at_gt: Int!) {
  subgraphDeployments(
    first: 1000
    orderBy: createdAt
    orderDirection: asc
    where: {activeSubgraphCount_gt: 0, deniedAt: 0, createdAt_gt: $created_at_gt}
  ) {
    ipfsHash
    indexerAllocations(where: {poi: null}) {
      allocatedTokens
      indexer {
        id
      }
    }
    signalledTokens
    createdAt
    deprecatedSubgraphCount
  }
}'

Next we initialize an empty dataset with tibble(), and initialize created_at_gt to 0. The GraphQL endpoints are limited to a maximum value of 1,000 rows at a time, so we will want to extract the data in parts and construct a complete dataset. There will never be more than 1,000 subgraphs created in the exact same second provided by createdAt, so we can use this as a filter to build a full dataset.

At the start, the filter will be set to 0, which will give us the first 1,000 created subgraphs. After that point, we can get the maximum createdAt value, and use that to only get subgraphs created after the maximum timestamp from the previous batch. This way we build a full dataset without pulling duplicate data.

The implementation isn’t overly important. The only important thing to understand is that we are pulling the signal and allocation sizes for each subgraph that exists in this step:

# initialize empty dataset
rewards_info = tibble()
# set created_at_gt variable to 0 to start before we overwrite it with the new maximum as data is pulled
variables = list(created_at_gt = 0)

# keep repeating the same query until we 
repeat {
  # Send POST request to the GraphQL API
  response = POST(url, body = list(query = query, variables = variables), encode = "json")
  
  # Parse the JSON response
  content = content(response, "text", encoding='UTF-8')
  json_data = fromJSON(content, flatten = TRUE)
  data = json_data$data$subgraphDeployments
  
  # if we are out of rows to collect, break the repetition and move on to the next step
  if (length(data) == 0) {
    break
  }
  
  # Process the data
  processed_data = tibble(data) %>%
    # calculate and clean up columns needed. Start by calculating count of indexers per subgraph
    mutate(unique_indexers = map_int(indexerAllocations, ~length(unique(.x$indexer.id))),
            # allocated tokens per subgraph
            total_allocated_tokens = map_dbl(indexerAllocations, ~sum(as.numeric(.x$allocatedTokens))) / 10^18,
            # signalled tokens per subgraph
            signalled_tokens = as.numeric(signalledTokens) / 10^18,
            # convert created at timestamp to datetime
            created_at = as.POSIXct(as.numeric(createdAt), origin = "1970-01-01"),
            # track whether the subgraph is deprecated
            deprecated_subgraph_count = deprecatedSubgraphCount) %>%
    # rename ipfsHash to deployment and select only the columns needed
    select(deployment = ipfsHash, unique_indexers, total_allocated_tokens, signalled_tokens, 
           created_at, deprecated_subgraph_count)
  
  # union the newly pulled data to the full dataset
  rewards_info = bind_rows(rewards_info, processed_data)
  
  # if less than 1,000 rows are collected this means there's no more data to collect and can move on
  if (nrow(processed_data) < 1000) {
    break
  }
  # set the new maximum subgraph creation time to use for the next data pull 
  variables$created_at_gt = max(as.numeric(data$createdAt))
}
# Show and confirm full row count
paste("Fetched indexer counts for", nrow(rewards_info), "subgraphs")
## [1] "Fetched indexer counts for 11324 subgraphs"
# join rewards info to the list of synced subgraphs
synced_subgraphs = synced_subgraphs %>%
  left_join(rewards_info, by = 'deployment') %>% 
  select(deployment, unique_indexers, signalled_tokens, total_allocated_tokens) %>% 
  mutate(rewards_proportion = signalled_tokens / total_allocated_tokens)
## # A tibble: 258 × 5
##    deployment            unique_indexers signalled_tokens total_allocated_tokens
##    <chr>                           <int>            <dbl>                  <dbl>
##  1 QmTyyrcaTduZN1ySRwtV…               5            9902.               2369230 
##  2 QmWXhLkz6fRJwLyFmgBK…              13            6931.               1938503.
##  3 QmVUcp8WqE47payYxMJU…               9            4554.               1048922 
##  4 QmdKXcBUHR3UyURqVRQH…              25           33256.               8813353.
##  5 QmW26TG5s9myd1gzio9f…              15            4843.               1316928.
##  6 QmYSxAAWeXDJ8DbWTy3o…              11            2976.                707341 
##  7 QmYV4MbMHuE98rWL5HRJ…               6             495.                117677 
##  8 QmddvgpoNsfxXyQ972Kh…               8            1981.                470963 
##  9 QmSK5hvEsM5mw6658SE1…              10            4952.               1167837 
## 10 QmQuCY9bwTF13ZkPdpgy…              NA              NA                     NA 
## # ℹ 248 more rows
## # ℹ 1 more variable: rewards_proportion <dbl>

Nice! Next, we want to also consider how many queries are served on each subgraph before we allocate towards them, which we will do in the next section

save.image('/root/github/indexer_analytics_tutorial/data/chapters_snapshots/06-indexing_rewards_proportions.RData')