Collectors
A collector allows users to define a hybrid collection of rules, based on:
- projects
- streams
- queries/filters.
At least one of these needs to be defined.
All results matching the collector's setup are buffered for 7 days on the server and consume 1 credit, each, no matter how many times they are read. In other words, data can be downloaded multiple times without additional cost.
A special use case of collectors lies in the access of past data, which is presented separately in the next section.
In this section, we present how to download the results of a collector and provide a list of operations (including the definition of a collector) along with examples.
Downloading the results of a collector
The search results of a collector can be accessed by a GET
HTTP request, allowing for several optional parameters (see Table Optional parameters).
curl -XGET 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>/results?access_token=<access_token>'
parameter | description | possible values | default value |
---|---|---|---|
resume_offset | position to resume the data access from. Can be retrieved from control chunks. | "earliest" | "latest" | <resume_token> | "earliest" |
end_behaviour | what to do when we reached the most recent result | "stop" | "wait" | "wait" |
Rate Limit
This endpoint is limited to 5 calls per minute.
Operations on the definition of a collector
The Talkwalker Streaming API allows users to create/replace, retrieve and delete a collector, using the endpoint
Definition of a collector
The definition of a collector consists of the following parameters.
parameter | description | ||
---|---|---|---|
|
| ||
|
|
| |
|
| ||
|
|
| |
|
| ||
|
|
| |
|
| ||
|
|
The collector_query
must never include both project
and project_topics
, only one of these two is allowed.
To create an active collector, at least one parameter is required in the collector_query
(e.g. project, stream or project_topics). +
An empty collector query can be used to create a paused collector for past data export tasks. See the section Creation of export tasks in Talkwalker projects.
For all list parameters except filters (e.g. streams
, queries
, topics
in project_topics
), only one element needs to match (OR
between elements of the list(stream IDs, topic IDs...)). +
All filters need to be matched (AND
between different filter IDs). +
If multiple parameters are provided (e.g. project_topics
and filters
), they must all be matched (AND
between different parameters).
Examples
Collector on a stream
{
"stream_id": "stream-1",
"rules": [
{
"rule_id": "<rule_id>",
"query": "<query>"
}
]
}
{
"collector_query" : {
"streams" : ["stream-1"]
}
All documents which match the stream remain in the collector for 7 days.
{
"collector_query" : {
"streams" : ["stream-1"],
"queries" : [{
"id" : "<q1>",
"query" : "<query>"
}]
}
This collector collects all documents, which match "stream-1" AND q1
.
{
"collector_query": {
"projects": ["<p1>", "<p2>"],
"queries": [
{
"id": "<q1>",
"query": "<query_1>"
},
{
"id": "<q2>",
"query": "<query_2>"
}
]
}
}
This collector collects all documents, which match (p1 OR p2) AND (q1 OR q2)
.
{
"collector_query": {
"queries": [
{
"id": "<q1>",
"query": "<query_1>"
},
{
"id": "<q2>",
"query": "<query_2>"
}
],
"project_topics": {
"project": "<p1>",
"topics": ["<t1>", "<t2>"]
},
"filters": ["<f1>", "<f2>"]
}
}
This collector collects all documents, which match (q1 OR q2) AND (t1 OR t2) AND f1 AND f2
.
Create / update a collector
curl -XPUT 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>?access_token=<access_token>'
-d '<collector_definition>'
-H 'Content-Type: application/json; charset=UTF-8'
For a <collector_definition>, the field state
should not be set (it is set to ACTIVE
automatically), and at least a project, a stream or a query must be set in the field collector_query
.
A collector can include only one project but multiple queries and streams. The number of allowed queries and streams is not limited.
curl -XPUT 'https://api.talkwalker.com/api/v3/stream/c/collector-1?access_token=<access_token>&pretty=true'
-d '{"collector_query" : {"streams" : ["stream-1"], "queries" : [{"id" : "q-1", "query" : "lang:en"}]}}'
-H 'Content-Type: application/json; charset=UTF-8'
{
"status_code": "0",
"status_message": "OK",
"request": "PUT /api/v3/stream/c/collector-1?access_token=<access_token>&pretty=true",
"result_stream": {
"collectors": [
{
"state": "ACTIVE",
"collector_id": "collector-1"
}
]
}
}
Rate Limit
This endpoint is limited to 20 calls per minute.
Retrieve the definition of a collector
curl -XGET 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>?access_token=<access_token>&pretty=true'
In the response, the state of the collector is included, which can assume the following values: UNKNOWN
, ACTIVE
, ERROR
, DELETED
, PAUSED
, NO_CREDITS
.
{
"status_code": "0",
"status_message": "OK",
"request": "GET /api/v3/stream/c/collector-1?access_token=<access_token>&pretty=true",
"result_stream": {
"collectors": [
{
"collector_id": "collector-1",
"state": "ACTIVE",
"query": {
"streams": ["stream-1"],
"queries": [
{
"id": "q-1",
"query": "lang:en"
}
]
}
}
]
}
}
Rate Limit
This endpoint is limited to 200 calls per minute.
Delete a collector
Deleting a collector permanently removes it and its content. A new collector with the same name can be created, but it will not include the old collector's results. Contrary, when updating a collector with a new query without deleting it, the old data is still included.
curl -XDELETE 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>?access_token=<access_token>&pretty=true'
{
"status_code": "0",
"status_message": "OK",
"request": "DELETE /api/v3/stream/c/collector-1?access_token=<access_token>&pretty=true",
"result_stream": {
"collectors": [
{
"collector_id": "collector-1",
"state": "DELETED"
}
]
}
}
Rate Limit
This endpoint is limited to 20 calls per minute.
Pause a collector
When calling this endpoint, a collector's state changes to "PAUSED". A collector does not collect any real-time data while it is paused. When resuming a paused collector, all previously collected data is still included. A paused collector that is chosen as target for an export task still receives all exported data.
curl -XPOST 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>/pause?access_token=<access_token>&pretty=true'
{
"status_code": "0",
"status_message": "OK",
"request": "POST /api/v3/stream/c/collector-1/pause?access_token=<access_token>&pretty=true",
"result_stream": {
"collectors": [
{
"collector_id": "collector-1",
"state": "PAUSED"
}
]
}
}
Rate Limit
This endpoint is limited to 40 calls per minute.
Resume a collector
Resuming a collector shifts its state from "PAUSED" to "ACTIVE". All incoming data from the point of resuming the collector onwards is stored again.
curl -XPOST 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>/resume?access_token=<access_token>&pretty=true'
{
"status_code": "0",
"status_message": "OK",
"request": "POST /api/v3/stream/c/collector-1/resume?access_token=<access_token>&pretty=true",
"result_stream": {
"collectors": [
{
"collector_id": "collector-1",
"state": "ACTIVE"
}
]
}
}
Rate Limit
This endpoint is limited to 40 calls per minute.
Resume a collector
Resuming a collector shifts its state from "PAUSED" to "ACTIVE". All incoming data from the point of resuming the collector onwards is stored again.
curl -XPOST 'https://api.talkwalker.com/api/v3/stream/c/<collector_id>/resume?access_token=<access_token>&pretty=true'
{
"status_code": "0",
"status_message": "OK",
"request": "POST /api/v3/stream/c/collector-1/resume?access_token=<access_token>&pretty=true",
"result_stream": {
"collectors": [
{
"collector_id": "collector-1",
"state": "ACTIVE"
}
]
}
}
Rate Limit
This endpoint is limited to 40 calls per minute.
Retrieve the information of all streams and collectors
curl -XGET 'https://api.talkwalker.com/api/v3/stream/info?access_token=<access_token>&pretty=true'
{
"status_code": "0",
"status_message": "OK",
"request": "GET /api/v3/stream/info?access_token=<access_token>&pretty=true",
"result_stream": {
"streams": [
{
"stream_id": "stream-1",
"enabled": true
}
],
"collectors": [
{
"collector_id": "collector-1",
"state": "ACTIVE"
}
]
}
}
Rate Limit
This endpoint is limited to 20 calls per minute.