Check Command Reference

To give an overview of available commands, we divided them into several categories.

AppDynamics

Enable AppDynamics Healthrule violations check and optionally query underlying Elasticsearch cluster raw logs.

appdynamics(url=None, username=None, password=None, es_url=None, index_prefix='')

Initialize AppDynamics wrapper.

Parameters:
  • url (str) – Appdynamics url.
  • username (str) – Appdynamics username.
  • password (str) – Appdynamics password.
  • es_url (str) – Appdynamics Elasticsearch cluster url.
  • index_prefix (str) – Appdynamics Elasticsearch cluster logs index prefix.

Note

If username and password are not supplied, then OAUTH2 will be used.

If appdynamics() is initialized with no args, then plugin configuration values will be used.

Methods of AppDynamics

healthrule_violations(application, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, severity=None)

Return Healthrule violations for AppDynamics application.

Parameters:
  • application (str) – Application name or ID
  • time_range_type (str) – Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and BETWEEN_TIMES. Default is BEFORE_NOW.
  • duration_in_mins (int) – Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types. Default is 5 mins.
  • start_time (int) – Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago.
  • end_time (int) – End time (in milliseconds) until which the metric data is returned. Default is now.
  • severity (str) – Filter results based on severity. Valid values are CRITICAL or WARNING.
Returns:

List of healthrule violations

Return type:

list

Example query:

appdynamics('https://appdynamics/controller/rest').healthrule_violations('49', time_range_type='BEFORE_NOW', duration_in_mins=5)

[
    {
        affectedEntityDefinition: {
            entityId: 408,
            entityType: "BUSINESS_TRANSACTION",
            name: "/error"
        },
        detectedTimeInMillis: 0,
        endTimeInMillis: 0,
        id: 39637,
        incidentStatus: "OPEN",
        name: "Backend errrors (percentage)",
        severity: "CRITICAL",
        startTimeInMillis: 1462244635000,
    }
]
metric_data(application, metric_path, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, rollup=True)

AppDynamics’s metric-data API

Parameters:
  • application (str) – Application name or ID
  • metric_path (str) – The path to the metric in the metric hierarchy
  • time_range_type (str) – Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and BETWEEN_TIMES. Default is BEFORE_NOW.
  • duration_in_mins (int) – Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types.
  • start_time (int) – Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago.
  • end_time (int) – End time (in milliseconds) until which the metric data is returned. Default is now.
  • rollup (bool) – By default, the values of the returned metrics are rolled up into a single data point (rollup=True). To get separate results for all values within the time range, set the rollup parameter to False.
Returns:

metric values for a metric

Return type:

list

query_logs(q='', body=None, size=100, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5)

Perform search query on AppDynamics ES logs.

Parameters:
  • q (str) – Query string used in search.
  • body (dict) – (dict) holding an ES query DSL.
  • size (int) – Number of hits to return. Default is 100.
  • source_type (str) – sourceType field filtering. Default to application-log, and will be part of q.
  • duration_in_mins (int) – Duration in mins before current time. Default is 5 mins.
Returns:

ES query result hits.

Return type:

list

count_logs(q='', body=None, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5)

Perform count query on AppDynamics ES logs.

Parameters:
  • q (str) – Query string used in search. Will be ingnored if body is not None.
  • body (dict) – (dict) holding an ES query DSL.
  • source_type (str) – sourceType field filtering. Default to application-log, and will be part of q.
  • duration_in_mins (int) – Duration in mins before current time. Default is 5 mins. Will be ignored if body is not None.
Returns:

Query match count.

Return type:

int

Note

In case of passing an ES query DSL in body, then all filter parameters should be explicitly added in the query body (e.g. eventTimestamp, application_id, sourceType).

Cassandra

Provides access to a Cassandra cluster via cassandra() wrapper object.

cassandra(node, keyspace, username=None, password=None, port=9042, connect_timeout=1, protocol_version=3)

Initialize cassandra wrapper.

Parameters:
  • node (str) – Cassandra host.
  • keyspace (str) – Cassandra keyspace used during the session.
  • username (str) – Username used in connection. It is recommended to use unprivileged user for cassandra checks.
  • password (str) – Password used in connection.
  • port (int) – Cassandra host port. Default is 9042.
  • connect_timeout (int) – Connection timeout.
  • protocol_version (str) – Protocol version used in connection. Default is 3.

Note

You should always use an unprivileged user to access your databases. Use plugin.cassandra.user and plugin.cassandra.pass to configure credentials for the zmon-worker.

execute(stmt)

Execute a CQL statement against the specified keyspace.

Parameters:stmt (str) – CQL statement
Returns:CQL result
Return type:list

CloudWatch

If running on AWS you can use cloudwatch() to access AWS metrics easily.

cloudwatch(region=None, assume_role_arn=None)

Initialize CloudWatch wrapper.

Parameters:
  • region (str) – AWS region for CloudWatch queries. Will be auto-detected if not supplied.
  • assume_role_arn (str) – AWS IAM role ARN to be assumed. This can be useful in cross-account CloudWatch queries.

Methods of Cloudwatch

query_one(dimensions, metric_name, statistics, namespace, period=60, minutes=5, start=None, end=None, extended_statistics=None)

Query a single AWS CloudWatch metric and return a single scalar value (float). Metric will be aggregated over the last five minutes using the provided aggregation type.

This method is a more low-level variant of the query method: all parameters, including all dimensions need to be known.

Parameters:
  • dimensions (dict) – Cloudwatch dimensions. Example {'LoadBalancerName': 'my-elb-name'}
  • metric_name (list) – Cloudwatch metric. Example 'Latency'.
  • statistics (list) – Cloudwatch metric statistics. Example 'Sum'
  • namespace (str) – Cloudwatch namespace. Example 'AWS/ELB'
  • period (int) – Cloudwatch statistics granularity in seconds. Default is 60.
  • minutes (int) – Used to determine start time of the Cloudwatch query. Default is 5. Ignored if start is supplied.
  • start (int) – Cloudwatch start timestamp. Default is None.
  • end (int) – Cloudwatch end timestamp. Default is None. If not supplied, then end time is now.
  • extended_statistics (list) – Cloudwatch ExtendedStatistics for percentiles query. Example ['p95', 'p99'].
Returns:

Return a float if single value, dict otherwise.

Return type:

float, dict

Example query with percentiles for AWS ALB:

cloudwatch().query_one({'LoadBalancer': 'app/my-alb/1234'}, 'TargetResponseTime', 'Average', 'AWS/ApplicationELB', extended_statistics=['p95', 'p99'])
{
    'TargetResponseTime': 0.224,
    'p95': 0.245,
    'p99': 0.300
}

Note

In very rare cases, e.g. for ELB metrics, you may see only 1/2 or 1-2/3 of the value in ZMON due to a race condition of what data is already present in cloud watch. To fix this click “evaluate” on the alert, this will trigger the check and move its execution time to a new start time.

query(dimensions, metric_name, statistics='Sum', namespace=None, period=60, minutes=5)

Query AWS CloudWatch for metrics. Metrics will be aggregated over the last five minutes using the provided aggregation type (default “Sum”).

dimensions is a dictionary to filter the metrics to query. See the list_metrics boto documentation. You can provide the special value “NOT_SET” for a dimension to only query metrics where the given key is not set. This makes sense e.g. for ELB metrics as they are available both per AZ (“AvailabilityZone” has a value) and aggregated over all AZs (“AvailabilityZone” not set). Additionally you can include the special “*” character in a dimension value to do fuzzy (shell globbing) matching.

metric_name is the name of the metric to filter against (e.g. “RequestCount”).

namespace is an optional namespace filter (e.g. “AWS/EC2).

To query an ELB for requests per second:

# both using special "NOT_SET" and "*" in dimensions here:
val = cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'LoadBalancerName': 'pierone-*'}, 'RequestCount', 'Sum')['RequestCount']
requests_per_second = val / 60

You can find existing metrics with the AWS CLI tools:

$ aws cloudwatch list-metrics --namespace "AWS/EC2"

Use the “dimensions” argument to select on what dimension(s) to aggregate over:

$ aws cloudwatch list-metrics --namespace "AWS/EC2" --dimensions Name=AutoScalingGroupName,Value=my-asg-FEYBCZF

The desired metric can now be queried in ZMON:

cloudwatch().query({'AutoScalingGroupName': 'my-asg-*'}, 'DiskReadBytes', 'Sum')
alarms(alarm_names=None, alarm_name_prefix=None, state_value=STATE_ALARM, action_prefix=None, max_records=50)

Retrieve cloudwatch alarms filtered by state value.

See describe_alarms boto documentation for more details.

Parameters:
  • alarm_names (list) – List of alarm names.
  • alarm_name_prefix (str) – Prefix of alarms. Cannot be specified if alarm_names is specified.
  • state_value (str) – State value used in alarm filtering. Available values are OK, ALARM (default) and INSUFFICIENT_DATA.
  • action_prefix (str) – Action name prefix. Example arn:aws:autoscaling: to filter results for all autoscaling related alarms.
  • max_records (int) – Maximum records to be returned. Default is 50.
Returns:

List of MetricAlarms.

Return type:

list

cloudwatch().alarms(state_value='ALARM')[0]
{
    'ActionsEnabled': True,
    'AlarmActions': ['arn:aws:autoscaling:...'],
    'AlarmArn': 'arn:aws:cloudwatch:...',
    'AlarmConfigurationUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 15, 707000, tzinfo=tzutc()),
    'AlarmDescription': 'Scale-down if CPU < 50% for 10.0 minutes (Average)',
    'AlarmName': 'metric-alarm-for-service-x',
    'ComparisonOperator': 'LessThanThreshold',
    'Dimensions': [
        {
            'Name': 'AutoScalingGroupName',
            'Value': 'service-x-asg'
        }
    ],
    'EvaluationPeriods': 2,
    'InsufficientDataActions': [],
    'MetricName': 'CPUUtilization',
    'Namespace': 'AWS/EC2',
    'OKActions': [],
    'Period': 300,
    'StateReason': 'Threshold Crossed: 1 datapoint (36.1) was less than the threshold (50.0).',
    'StateReasonData': '{...}',
    'StateUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 16, 294000, tzinfo=tzutc()),
    'StateValue': 'ALARM',
    'Statistic': 'Average',
    'Threshold': 50.0
}

Counter

The counter() function allows you to get increment rates of increasing counter values. Main use case for using counter() is to get rates per second of JMX counter beans (e.g. “Tomcat Requests”). The counter function requires one parameter key to identify the counter.

per_second(value)
counter('requests').per_second(get_total_requests())

Returns the value’s increment rate per second. Value must be a float or integer.

per_minute(value)
counter('requests').per_minute(get_total_requests())

Convenience method to return the value’s increment rate per minute (same as result of per_second() divided by 60).

Internally counter values and timestamps are stored in Redis.

DNS

The dns() function provide a way to resolve hosts.

dns(host=None)

Methods of DNS

resolve(host=None)

Return IP address of host. If host is None, then will resolve host used in initialization. If both are None then exception will be raised.

Returns:IP address
Return type:str

Example query:

dns('google.de').resolve()
'173.194.65.94'

dns().resolve('google.de')
'173.194.65.94'

EBS

Allows to describe EBS objects (currently, only Snapshots are supported).

ebs()

Methods of EBS

list_snapshots(account_id, max_items)

List the EBS Snapshots owned by the given account_id. By default, listing is possible for up to 1000 items, so we use pagination internally to overcome this.

Parameters:
  • account_id – AWS account id number (as a string). Defaults to the AWS account id where the check is running.
  • max_items – the maximum number of snapshots to list. Defaults to 100.
Returns:

an EBSSnapshotsList object

class EBSSnapshotsList
items()

Returns a list of dicts like

{
    "id": "snap-12345",
    "description": "Snapshot description...",
    "size": 123,
    "start_time": datetime.datetime(2017, 7, 16, 1, 1, 21, tzinfo=tzutc()),
    "state": "completed"
}

Example usage:

ebs().list_snapshots().items()

snapshots = ebs().list_snapshots(max_items=1000).items()  # for listing more than the default of 100 snapshots
start_time = snapshots[0]["start_time"].isoformat()  # returns a string that can be passed to time()
age = time() - time(start_time)

Elasticsearch

Provides search queries and health check against an Elasticsearch cluster.

elasticsearch(url=None, timeout=10, oauth2=False)

Note

If url is None, then the plugin will use the default Elasticsearch cluster set in worker configuration.

Methods of Elasticsearch

Search ES cluster using URI or Request body search. If body is None then GET request will be used.

Parameters:
  • indices (list) – List of indices to search. Limited to only 10 indices. [‘_all’] will search all available indices, which effectively leads to same results as None. Indices can accept wildcard form.
  • q (str) – Search query string. Will be ignored if body is not None.
  • body (dict) – Dict holding an ES query DSL.
  • source (bool) – Whether to include _source field in query response.
  • size (int) – Number of hits to return. Maximum value is 1000. Set to 0 if interested in hits count only.
Returns:

ES query result.

Return type:

dict

Example query:

elasticsearch('http://es-cluster').search(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500', size=0, source=False)

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [],
        "max_score": 0.0,
        "total": 1
    },
    "timed_out": false,
    "took": 2
}
count(indices=None, q='', body=None)

Return ES count of matching query.

Parameters:
  • indices (list) – List of indices to search. Limited to only 10 indices. [‘_all’] will search all available indices, which effectively leads to same results as None. Indices can accept wildcard form.
  • q (str) – Search query string. Will be ignored if body is not None.
  • body (dict) – Dict holding an ES query DSL.
Returns:

ES query result.

Return type:

dict

Example query:

elasticsearch('http://es-cluster').count(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500')

{
    "_shards": {
        "failed": 0,
        "successful": 16,
        "total": 16
    },
    "count": 12
}
health()

Return ES cluster health.

Returns:Cluster health result.
Return type:dict
elasticsearch('http://es-cluster').health()

{
    "active_primary_shards": 11,
    "active_shards": 11,
    "active_shards_percent_as_number": 50.0,
    "cluster_name": "big-logs-cluster",
    "delayed_unassigned_shards": 0,
    "initializing_shards": 0,
    "number_of_data_nodes": 1,
    "number_of_in_flight_fetch": 0,
    "number_of_nodes": 1,
    "number_of_pending_tasks": 0,
    "relocating_shards": 0,
    "status": "yellow",
    "task_max_waiting_in_queue_millis": 0,
    "timed_out": false,
    "unassigned_shards": 11
}

Entities

Provides access to ZMON entities.

entities(service_url, infrastructure_account, verify=True, oauth2=False)

Initialize entities wrapper.

Parameters:
  • service_url (str) – Entities service url.
  • infrastructure_account (str) – Infrastructure account used to filter entities.
  • verify – Verify SSL connection. Default is True.
  • oauth2 (bool) – Use OAUTH for authentication. Default is False.

Note

If service_url or infrastructure_account were not supplied, their corresponding values in worker plugin config will be used.

Methods of Entities

search_local(**kwargs)

Search entities in local infrastructure account. If infrastructure_account is not supplied in kwargs, then should search entities “local” to your filtered entities by using the same infrastructure_account as a default filter.

Parameters:kwargs (str) – Filtering kwargs
Returns:Entities
Return type:list

Example searching all instance entities in local account:

entities().search_local(type='instance')
search_all(**kwargs)

Search all entities.

Parameters:kwargs (str) – Filtering kwargs
Returns:Entities
Return type:list
alert_coverage(**kwargs)

Return alert coverage for infrastructure_account.

Parameters:kwargs (str) – Filtering kwargs
Returns:Alert coverage result.
Return type:list
entities().alert_coverage(type='instance', infrastructure_account='1052643')

[
    {
        'alerts': [],
        'entities': [
            {'id': 'app-1-instance', 'type': 'instance'}
        ]
    }
]

EventLog

The eventlog() function allows you to conveniently count EventLog events by type and time.

count(event_type_ids, time_from[, time_to=None][, group_by=None])

Return event counts for given parameters.

event_type_ids is either a single integer (use hex notation, e.g. 0x96001) or a list of integers.

time_from is a string time specification ('-5m' means 5 minutes ago, '-1h' means 1 hour ago).

time_to is a string time specification and defaults to now if not given.

group_by can specify an EventLog field name to group counts by

eventlog().count(0x96001, time_from='-1m')                         # returns a single number
eventlog().count([0x96001, 0x63005], time_from='-1m')              # returns dict {'96001': 123, '63005': 456}
eventlog().count(0x96001, time_from='-1m', group_by='appDomainId') # returns dict {'1': 123, '5': 456, ..}

The count() method internally requests the EventLog Viewer’s “count” JSON endpoint.

History

Wrapper for KairosDB to access history data about checks.

history(url=None, check_id='', entities=None, oauth2=False)

Methods of History

result(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)

Return query result.

Parameters:
  • time_from – Relative time from in seconds. Default is ONE_WEEK_AND_5MIN.
  • time_to – Relative time to in seconds. Default is ONE_WEEK.
Returns:

Json result

Return type:

dict

get_one(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)

Return first result values.

Parameters:
  • time_from – Relative time from in seconds. Default is ONE_WEEK_AND_5MIN.
  • time_to – Relative time to in seconds. Default is ONE_WEEK.
Returns:

List of values

Return type:

list

get_aggregated(key, aggregator, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)

Return first result values. If no key filtering matches, empty list is returned.

Parameters:
  • key (str) – Tag key used in filtering the results.
  • aggregator (str) – Aggregator used in query. (e.g ‘avg’)
  • time_from – Relative time from in seconds. Default is ONE_WEEK_AND_5MIN.
  • time_to – Relative time to in seconds. Default is ONE_WEEK.
Returns:

List of values

Return type:

list

get_avg(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)

Return aggregated average.

Parameters:
  • key (str) – Tag key used in filtering the results.
  • time_from – Relative time from in seconds. Default is ONE_WEEK_AND_5MIN.
  • time_to – Relative time to in seconds. Default is ONE_WEEK.
Returns:

List of values

Return type:

list

get_std_dev(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)

Return aggregated standard deviation.

Parameters:
  • key (str) – Tag key used in filtering the results.
  • time_from – Relative time from in seconds. Default is ONE_WEEK_AND_5MIN.
  • time_to – Relative time to in seconds. Default is ONE_WEEK.
Returns:

List of values

Return type:

list

distance(self, weeks=4, snap_to_bin=True, bin_size='1h', dict_extractor_path='')

For detailed docs on distance function please see History distance functionality .

HTTP

Access to HTTP (and HTTPS) endpoints is provided by the http() function.

http(url[, method='GET'][, timeout=10][, max_retries=0][, verify=True][, oauth2=False][, allow_redirects=None][, headers=None])
Parameters:
  • url (str) – The URL that is to be queried. See below for details.
  • method (str) – The HTTP request method. Allowed values are GET or HEAD.
  • timeout (float) – The timeout for the HTTP request, in seconds. Defaults to 10.
  • max_retries (int) – The number of times the HTTP request should be retried if it fails. Defaults to 0.
  • verify (bool) – Can be set to False to disable SSL certificate verification.
  • oauth2 (bool) – Can be set to True to inject a OAuth 2 Bearer access token in the outgoing request
  • oauth2_token_name (str) – The name of the OAuth 2 token. Default is uid.
  • allow_redirects (bool) – Follow request redirects. If None then it will be set to True in case of GET and False in case of HEAD request.
  • headers (dict) – The headers to be used in the HTTP request.
Returns:

An object encapsulating the response from the server. See below.

For checks on entities that define the attributes url or host, the given URL may be relative. In that case, the URL http://<value><url> is queried, where <value> is the value of that attribute, and <url> is the URL passed to this function. If an entity defines both url and host, the former is used.

This function cannot query URLs using a scheme other than HTTP or HTTPS; URLs that do not start with http:// or https:// are considered to be relative.

Example:

http('http://www.example.org/data?fetch=json').json()

# avoid raising error in case the response error status (e.g. 500 or 503)
# but you are interested in the response json
http('http://www.example.org/data?fetch=json').json(raise_error=False)

HTTP Responses

The object returned by the http() function provides methods: json(), text(), headers(), cookies(), content_size(), time() and code().

json(raise_error=True)

This method returns an object representing the content of the JSON response from the queried endpoint. Usually, this will be a map (represented by a Python dict), but, depending on the endpoint, it may also be a list, string, set, integer, floating-point number, or Boolean.

text(raise_error=True)

Returns the text response from queried endpoint:

http("/heartbeat.jsp", timeout=5).text().strip()=='OK: JVM is running'

Since we’re using a relative url, this check has to be defined for specific entities (e.g. type=zomcat will run it on all zomcat instances). The strip function removes all leading and trailing whitespace.

headers(raise_error=True)

Returns the response headers in a case-insensitive dict-like object:

http("/api/json", timeout=5).headers()['content-type']=='application/json'
cookies(raise_error=True)

Returns the response cookies in a dict like object:

http("/heartbeat.jsp", timeout=5).cookies()['my_custom_cookie'] == 'custom_cookie_value'
content_size(raise_error=True)

Returns the length of the response content:

http("/heartbeat.jsp", timeout=5).content_size() > 1024
time(raise_error=True)

Returns the elapsed time in seconds until response was received:

http("/heartbeat.jsp", timeout=5).time() > 1.5
code()

Return HTTP status code from the queried endpoint.:

http("/heartbeat.jsp", timeout=5).code()
actuator_metrics(prefix='zmon.response.', raise_error=True)

Parses the json result of a metrics endpoint into a map ep->method->status->metric

http(“/metrics”, timeout=5).actuator_metrics()
prometheus()

Parse the resulting text result according to the Prometheus specs using their prometheus_client.

http(“/metrics”, timeout=5).prometheus()

JMX

To use JMXQuery, run “jmxquery” (this is not yet released)

Queries beans’ attributes on hosts specified in entities filter:

jmx().query('java.lang:type=Memory', 'HeapMemoryUsage', 'NonHeapMemoryUsage').results()

Another example:

jmx().query('java.lang:type=Threading', 'ThreadCount', 'DaemonThreadCount', 'PeakThreadCount').results()

This would return a dict like:

{
    "DaemonThreadCount": 524,
    "PeakThreadCount": 583,
    "ThreadCount": 575
}

KairosDB

Provides read access to the target KairosDB

kairosdb(url, oauth2=False)

Methods of KairosDB

query(name, group_by = None, tags = None, start = -5, end = 0, time_unit='seconds', aggregators = None, start_absolute = None, end_absolute = None)

Query kairosdb.

Parameters:
  • name (str) – Metric name.
  • group_by (list) – List of fields to group by.
  • tags (dict) – Filtering tags.
  • start (int) – Relative start time. Default is 5. Should be greater than or equal 1.
  • end (int) – End time. Default is 0. If not 0, then it should be greater than or equal to 1.
  • time_unit (str.) – Time unit (‘seconds’, ‘minutes’, ‘hours’). Default is ‘minutes’.
  • aggregators (list) – List of aggregators.
  • start_absolute (long) – Absolute start time in milliseconds, overrides the start parameter which is relative
  • end_absolute (long) – Absolute end time in milliseconds, overrides the end parameter which is relative
Returns:

Result queries.

Return type:

dict

Kubernetes

Provides a wrapper for querying Kubernetes cluster resources.

kubernetes(namespace='default')

If namespace is None then all namespaces will be queried. This however will increase the number of calls to Kubernetes API server.

Note

  • Kubernetes wrapper will authenticate using service account, which assumes the worker is running in a Kubernetes cluster.
  • All Kubernetes wrapper calls are scoped to the Kubernetes cluster hosting the worker. It is not intended to be used in querying multiple clusters.

Label Selectors

Kubernetes API provides a way to filter resources using labelSelector. Kubernetes wrapper provides a friendly syntax for filtering.

The following examples show different usage of the Kubernetes wrapper utilizing label filtering:

# Get all pods with label ``application`` equal to ``zmon-worker``
kubernetes().pods(application='zmon-worker')
kubernetes().pods(application__eq='zmon-worker')


# Get all pods with label ``application`` **not equal to** ``zmon-worker``
kubernetes().pods(application__neq='zmon-worker')


# Get all pods with label ``application`` **any of** ``zmon-worker`` or ``zmon-agent``
kubernetes().pods(application__in=['zmon-worker', 'zmon-agent'])

# Get all pods with label ``application`` **not any of** ``zmon-worker`` or ``zmon-agent``
kubernetes().pods(application__notin=['zmon-worker', 'zmon-agent'])

Methods of Kubernetes

pods(name=None, phase=None, ready=None, **kwargs)

Return list of Pods.

Parameters:
  • name (str) – Pod name.
  • phase (str) – Pod status phase. Valid values are: Pending, Running, Failed, Succeeded or Unknown.
  • ready (bool) – Pod readiness status. If None then all pods are returned.
  • **kwargs

    Pod Label Selectors filters.

Returns:

List of pods. Typical pod has “metadata”, “status” and “spec” fields.

Return type:

list

nodes(name=None, **kwargs)

Return list of Nodes. Namespace does not apply.

Parameters:
Returns:

List of nodes. Typical pod has “metadata”, “status” and “spec” fields.

Return type:

list

services(name=None, **kwargs)

Return list of Services.

Parameters:
Returns:

List of services. Typical service has “metadata”, “status” and “spec” fields.

Return type:

list

endpoints(name=None, **kwargs)

Return list of Endpoints.

Parameters:
Returns:

List of Endpoints. Typical Endpoint has “metadata”, and “subsets” fields.

Return type:

list

ingresses(name=None, **kwargs)

Return list of Ingresses.

Parameters:
Returns:

List of Ingresses. Typical Ingress has “metadata”, “spec” and “status” fields.

Return type:

list

statefulsets(name=None, replicas=None, **kwargs)

Return list of Statefulsets.

Parameters:
  • name (str) – Statefulset name.
  • replicas (int) – Statefulset replicas.
  • **kwargs

    Statefulset Label Selectors filters.

Returns:

List of Statefulsets. Typical Statefulset has “metadata”, “status” and “spec” fields.

Return type:

list

daemonsets(name=None, **kwargs)

Return list of Daemonsets.

Parameters:
Returns:

List of Daemonsets. Typical Daemonset has “metadata”, “status” and “spec” fields.

Return type:

list

replicasets(name=None, replicas=None, **kwargs)

Return list of ReplicaSets.

Parameters:
  • name (str) – ReplicaSet name.
  • replicas (int) – ReplicaSet replicas.
  • **kwargs

    ReplicaSet Label Selectors filters.

Returns:

List of ReplicaSets. Typical ReplicaSet has “metadata”, “status” and “spec” fields.

Return type:

list

deployments(name=None, replicas=None, ready=None, **kwargs)

Return list of Deployments.

Parameters:
  • name (str) – Deployment name.
  • replicas (int) – Deployment replicas.
  • ready (bool) – Deployment readiness status.
  • **kwargs

    Deployment Label Selectors filters.

Returns:

List of Deployments. Typical Deployment has “metadata”, “status” and “spec” fields.

Return type:

list

configmaps(name=None, **kwargs)

Return list of ConfigMaps.

Parameters:
Returns:

List of ConfigMaps. Typical ConfigMap has “metadata” and “data”.

Return type:

list

persistentvolumeclaims(name=None, phase=None, **kwargs)

Return list of PersistentVolumeClaims.

Parameters:
  • name (str) – PersistentVolumeClaim name.
  • phase (str) – Volume phase.
  • **kwargs

    PersistentVolumeClaim Label Selectors filters.

Returns:

List of PersistentVolumeClaims. Typical PersistentVolumeClaim has “metadata”, “status” and “spec” fields.

Return type:

list

persistentvolumes(name=None, phase=None, **kwargs)

Return list of PersistentVolumes.

Parameters:
  • name (str) – PersistentVolume name.
  • phase (str) – Volume phase.
  • **kwargs

    PersistentVolume Label Selectors filters.

Returns:

List of PersistentVolumes. Typical PersistentVolume has “metadata”, “status” and “spec” fields.

Return type:

list

metrics()

Return API server metrics in prometheus format.

Returns:Cluster metrics.
Return type:dict

LDAP

Retrieve OpenLDAP statistics (needs “cn=Monitor” database installed in LDAP server).

ldap().statistics()

This would return a dict like:

{
    "connections_current": 77,
    "connections_per_sec": 27.86,
    "entries": 359369,
    "max_file_descriptors": 65536,
    "operations_add_per_sec": 0.0,
    "operations_bind_per_sec": 27.99,
    "operations_delete_per_sec": 0.0,
    "operations_extended_per_sec": 0.23,
    "operations_modify_per_sec": 0.09,
    "operations_search_per_sec": 24.09,
    "operations_unbind_per_sec": 27.82,
    "waiters_read": 76,
    "waiters_write": 0
}

All information is based on the cn=Monitor OpenLDAP tree. You can get more information in the OpenLDAP Administrator’s Guide. The meaning of the different fields is as follows:

connections_current
Number of currently established TCP connections.
connections_per_sec
Increase of connections per second.
entries
Number of LDAP records.
operations_*_per_sec
Number of operations per second per operation type (add, bind, search, ..).
waiters_read
Number of waiters for read (whatever that means, OpenLDAP documentation does not say anything).

Memcached

Read-only access to memcached servers is provided by the memcached() function.

memcached([host=some.host][, port=11211])

Returns a connection to the Memcached server at <host>:<port>, where <host> is the value of the current entity’s host attribute, and <port> is the given port (default 11211). See below for a list of methods provided by the returned connection object.

Methods of the Memcached Connection

The object returned by the memcached() function provides the following methods:

get(key)

Returns the string stored at key. If key does not exist an error is raised.

memcached().get("example_memcached_key")
json(key)

Returns the data of the key as unserialized JSON data. I.e. you can store a JSON object as value of the key and get a dict back

memcached().json("example_memcached_key")
stats([extra_keys=[STR, STR])

Returns a dict with general Memcached statistics such as memory usage and operations/s. All values are extracted using the Memcached STATS command.

The extra_keys may be retrieved as returned as well from the memcached server’s stats command, e.g. version or uptime.

Example result:

{
    "incr_hits_per_sec": 0,
    "incr_misses_per_sec": 0,
    "touch_misses_per_sec": 0,
    "decr_misses_per_sec": 0,
    "touch_hits_per_sec": 0,
    "get_expired_per_sec": 0,
    "get_hits_per_sec": 100.01,
    "cmd_get_per_sec": 119.98,
    "cas_hits_per_sec": 0,
    "cas_badval_per_sec": 0,
    "delete_misses_per_sec": 0,
    "bytes_read_per_sec": 6571.76,
    "auth_errors_per_sec": 0,
    "cmd_set_per_sec": 19.97,
    "bytes_written_per_sec": 6309.17,
    "get_flushed_per_sec": 0,
    "delete_hits_per_sec": 0,
    "cmd_flush_per_sec": 0,
    "curr_items": 37217768,
    "decr_hits_per_sec": 0,
    "connections_per_sec": 0.02,
    "cas_misses_per_sec": 0,
    "cmd_touch_per_sec": 0,
    "bytes": 3902170728,
    "evictions_per_sec": 0,
    "auth_cmds_per_sec": 0,
    "get_misses_per_sec": 19.97
}

MongoDB

Provides access to a MongoDB cluster

mongodb(host, port=27017)

Methods of MongoDB

find(database, collection, query)

Nagios

This function provides a wrapper for Nagios plugins.

check_load()
nagios().nrpe('check_load')

Example check result as JSON:

{
    "load1": 2.86,
    "load15": 3.13,
    "load5": 3.23
}
check_list_timeout()
nagios().nrpe('check_list_timeout',  path="/data/production/", timeout=10)

This command will run “timeout 10 ls /data/production/” on the target host via nrpe.

Example check result as JSON:

{

    "exit":0,
    "timeout":0
}

Exit is the exitcode from nrpe 0 for OK, 2 for ERROR. Timeout should not be used, yet.

check_diff_reverse()
nagios().nrpe('check_diff_reverse')

Example check result as JSON:

{
    "CommitLimit-Committed_AS": 16022524
}
check_mailq_postfix()
nagios().nrpe('check_mailq_postfix')

Example check result as JSON:

{
    "unsent": 0
}
check_memcachestatus()
nagios().nrpe('check_memcachestatus', port=11211)

Example check result as JSON:

{
    "curr_connections": 0.0,
    "cmd_get": 3569.09,
    "bytes_written": 66552.9,
    "get_hits": 1593.9,
    "cmd_set": 0.04,
    "curr_items": 0.0,
    "get_misses": 1975.19,
    "bytes_read": 83077.28
}
check_findfiles()

Find-file analyzer plugin for Nagios. This plugin checks for newer files within a directory and checks their access time, modification time and count.

nagios().nrpe('check_findfiles', directory='/data/example/error/', epoch=1)

Example check result as JSON:

{
    "ftotal": 0,
    "faccess": 0,
    "fmodify": 0
}
check_findolderfiles()

Find-file analyzer plugin for Nagios. This plugin checks for files within a directory older than 2 given times in minutes.

nagios().nrpe('check_findolderfiles', directory='/data/stuff,/mnt/other', time01=480, time02=600)

Example check result as JSON:

{
    "total files": 831,
    "files older than time01": 112,
    "files older than time02": 0
}
check_findfiles_names()

Find-file analyzer plugin for Nagios. This plugin checks for newer files within a directory, optionally matching a filename pattern, and checks their access time, modification time and count.

nagios().nrpe('check_findfiles_names', directory='/mnt/storage/error/', epoch=1, name='app*')

Example check result as JSON:

{
    "ftotal": 0,
    "faccess": 0,
    "fmodify": 0
}
check_findfiles_names_exclude()

Find-file analyzer plugin for Nagios. This plugin checks for newer files within a directory, optionally matching a filename pattern(in this command the files are excluded), and checks their access time, modification time and count.

nagios().nrpe('check_findfiles_names_exclude', directory='/mnt/storage/error/', epoch=1, name='app*')

Example check result as JSON:

{
    "ftotal": 0,
    "faccess": 0,
    "fmodify": 0
}
check_logwatch()
nagios().nrpe('check_logwatch', logfile='/var/logs/example/p{}/catalina.out'.format(entity['instance']), pattern='Full.GC')

Example check result as JSON:

{
    "last": 0,
    "total": 0
}
check_ntp_time()
nagios().nrpe('check_ntp_time')

Example check result as JSON:

{
    "offset": 0.003063
}
check_iostat()
nagios().nrpe('check_iostat', disk='sda')

Example check result as JSON:

{
    "tps": 944.7,
    "iowrite": 6858.4,
    "ioread": 6268.4
}
check_hpacucli()
nagios().nrpe('check_hpacucli')

Example check result as JSON:

{
    "logicaldrive_1": "OK",
    "logicaldrive_2": "OK",
    "logicaldrive_3": "OK",
    "physicaldrive_2I:1:6": "OK",
    "physicaldrive_2I:1:5": "OK",
    "physicaldrive_1I:1:3": "OK",
    "physicaldrive_1I:1:2": "OK",
    "physicaldrive_1I:1:1": "OK",
    "physicaldrive_1I:1:4": "OK"
}
check_hpasm_fix_power_supply()
nagios().nrpe('check_hpasm_fix_power_supply')

Example check result as JSON:

{
    "status": "OK",
    "message": "System: 'proliant dl360 g6', S/N: 'CZJ947016M', ROM: 'P64 05/05/2011', hardware working fine, da: 3 logical drives, 6 physical drives cpu_0=ok cpu_1=ok ps_2=ok fan_1=46% fan_2=46% fan_3=46% fan_4=46% temp_1=21 temp_2=40 temp_3=40 temp_4=36 temp_5=35 temp_6=37 temp_7=32 temp_8=36 temp_9=32 temp_10=36 temp_11=32 temp_12=33 temp_13=48 temp_14=29 temp_15=32 temp_16=30 temp_17=29 temp_18=39 temp_19=37 temp_20=38 temp_21=45 temp_22=42 temp_23=39 temp_24=48 temp_25=35 temp_26=46 temp_27=35 temp_28=71 | fan_1=46%;0;0 fan_2=46%;0;0 fan_3=46%;0;0 fan_4=46%;0;0 'temp_1_ambient'=21;42;42 'temp_2_cpu#1'=40;82;82 'temp_3_cpu#2'=40;82;82 'temp_4_memory_bd'=36;87;87 'temp_5_memory_bd'=35;78;78 'temp_6_memory_bd'=37;87;87 'temp_7_memory_bd'=32;78;78 'temp_8_memory_bd'=36;87;87 'temp_9_memory_bd'=32;78;78 'temp_10_memory_bd'=36;87;87 'temp_11_memory_bd'=32;78;78 'temp_12_power_supply_bay'=33;59;59 'temp_13_power_supply_bay'=48;73;73 'temp_14_memory_bd'=29;60;60 'temp_15_processor_zone'=32;60;60 'temp_16_processor_zone'=3"
}
check_hpasm_gen8()
nagios().nrpe('check_hpasm_gen8')

Example check result as JSON:

{
    "status": "OK",
    "message": "ignoring 16 dimms with status 'n/a' , System: 'proliant dl360p gen8', S/N: 'CZJ2340R6C', ROM: 'P71 08/20/2012', hardware working fine, da: 1 logical drives, 4 physical drives"
}
check_openmanage()
nagios().nrpe('check_openmanage')

Example check result as JSON:

{
    "status": "OK",
    "message": "System: 'PowerEdge R720', SN: 'GN2J8X1', 256 GB ram (16 dimms), 5 logical drives, 10 physical drives|T0_System_Board_Inlet=21C;42;47 T1_System_Board_Exhaust=36C;70;75 T2_CPU1=59C;95;100 T3_CPU2=52C;95;100 W2_System_Board_Pwr_Consumption=168W;896;980 A0_PS1_Current_1=0.8A;0;0 A1_PS2_Current_2=0.2A;0;0 V25_PS1_Voltage_1=230V;0;0 V26_PS2_Voltage_2=232V;0;0 F0_System_Board_Fan1=1680rpm;0;0 F1_System_Board_Fan2=1800rpm;0;0 F2_System_Board_Fan3=1680rpm;0;0 F3_System_Board_Fan4=2280rpm;0;0 F4_System_Board_Fan5=2400rpm;0;0 F5_System_Board_Fan6=2400rpm;0;0"
}
check_ping()
nagios().local('check_ping')

Example check result as JSON:

{
    "rta": 1.899,
    "pl": 0.0
}
check_apachestatus_uri()
nagios().nrpe('check_apachestatus_uri', url='http://127.0.0.1/server-status?auto') or nagios().nrpe('check_apachestatus_uri', url='http://127.0.0.1:10083/server-status?auto')

Example check result as JSON:

{
    "idle": 60.0,
    "busy": 15.0,
    "hits": 24.256,
    "kBytes": 379.692
}
check_check_command_procs()
nagios().nrpe('check_command_procs', process='httpd')

Example check result as JSON:

{
    "procs": 33
}
check_http_expect_port_header()
nagios().nrpe('check_http_expect_port_header', ip='localhost', url= '/', redirect='warning', size='9000:90000', expect='200', port='88', hostname='www.example.com')

Example check result as JSON:

{
    "size": 33633.0,
    "time": 0.080755
}

NOTE: if the nrpe check returns an ‘expect’result(return code is not the expected) , the check returns a NagiosError

check_mysql_processes()
nagios().nrpe('check_mysql_processes', host='localhost', port='/var/lib/mysql/mysql.sock', user='myuser', password='mypas')

Example check result as JSON:

{
    "avg": 0,
    "threads": 1
}
check_mysqlperformance()
nagios().nrpe('check_mysqlperformance', host='localhost', port='/var/lib/mysql/mysql.sock', user='myuser', password='mypass')

Example check result as JSON:

{
    "Com_select": 15.27,
    "Table_locks_waited": 0.01,
    "Select_scan": 2.25,
    "Com_change_db": 0.0,
    "Com_insert": 382.26,
    "Com_replace": 8.09,
    "Com_update": 335.7,
    "Com_delete": 0.02,
    "Qcache_hits": 16.57,
    "Questions": 768.14,
    "Qcache_not_cached": 1.8,
    "Created_tmp_tables": 2.43,
    "Created_tmp_disk_tables": 2.25,
    "Aborted_clients": 0.3
}
check_mysql_slave()
nagios().nrpe('check_mysql_slave', host='localhost', port='/var/lib/mysql/mysql.sock', database='mydb', user='myusr', password='mypwd')

Example check result as JSON:

{
    "Uptime": 6215760.0,
    "Open tables": 3953.0,
    "Slave IO": "Yes",
    "Queries per second avg": 967.106,
    "Slow queries": 1047406.0,
    "Seconds Behind Master": 0.0,
    "Threads": 1262.0,
    "Questions": 6011300666.0,
    "Slave SQL": "Yes",
    "Flush tables": 1.0,
    "Opens": 59466.0
}
check_ssl_cert()
nagios().nrpe('check_ssl_cert', host_ip='91.240.34.73', domain_name='www.example.com') or nagios().local('check_ssl_cert', host_ip='91.240.34.73', domain_name='www.example.com')

Example check result as JSON:

{
    "days": 506
}

NRPE checks for Windows Hosts

Checks are based on nsclient++ v.0.4.1. For more info look: http://docs.nsclient.org/
CheckCounter()

Returns performance counters for a process(usedMemory/WorkingSet)

nagios().win('CheckCounter', process='eo_server')

Example check result as JSON:

used memory in bytes

{
    "ProcUsedMem": 811024384
}
CheckCPU()
nagios().win('CheckCPU')

Example check result as JSON:

{
    "1": 4,
    "10": 8,
    "5": 14
}
CheckDriveSize()
nagios().win('CheckDriveSize')

Example check result as JSON:

Used Space in MByte
{
    "C:\\ %": 61.0,
    "C:\\": 63328.469

}
CheckEventLog()
nagios().win('CheckEventLog', log='application', query='generated gt -7d AND type=\'error\'')

‘generated gt -7d’ means in the last 7 days

Example check result as JSON:

{
    "eventlog": 20
}
CheckFiles()
nagios().win('CheckFiles', path='C:\\Import\\Exchange2Clearing', pattern='*.*', query='creation lt -1h')

‘creation lt -1h’ means older than 1 hour

Example check result as JSON:

{
    "found files": 22
}
CheckLogFile()
nagios().win('CheckLogFile', logfile='c:\Temp\log\maxflow_portal.log', seperator=' ', query='column4 = \'ERROR\' OR column4 = \'FATAL\'')

Example check result as JSON:

{
    "count": 4
}
CheckMEM()
nagios().win('CheckMEM')

Example check result as JSON:

used memory in MBytes

{
    "page file %": 16.0,
    "page file": 5534.105,
    "physical memory": 3331.109,
    "virtual memory": 268.777,
    "virtual memory %": 0.0,
    "physical memory %": 20.0
}
CheckProcState()
nagios().win('CheckProcState', process='check_mk_agent.exe')

Example check result as JSON:

{
    "status": "OK",
    "message": "check_mk_agent.exe: running"
}
CheckServiceState()
nagios().win('CheckServiceState', service='ENAIO_server')

Example check result as JSON:

{
    "status": "OK",
    "message": "ENAIO_server: started"
}
CheckUpTime()
nagios().win('CheckUpTime')

Example check result as JSON:

uptime in ms

{
    "uptime": 412488000
}

Ping

Simple ICMP ping function which returns True if the ping command returned without error and False otherwise.

ping(timeout=1)
ping()

The timeout argument specifies the timeout in seconds. Internally it just runs the following system command:

ping -c 1 -w <TIMEOUT> <HOST>

Redis

Read-only access to Redis servers is provided by the redis() function.

redis([port=6379][, db=0])

Returns a connection to the Redis server at <host>:<port>, where <host> is the value of the current entity’s host attribute, and <port> is the given port (default 6379). See below for a list of methods provided by the returned connection object.

Please also have a look at the Redis documentation.

Methods of the Redis Connection

The object returned by the redis() function provides the following methods:

llen(key)

Returns the length of the list stored at key. If key does not exist, it’s value is treated as if it were an empty list, and 0 is returned. If key exists but is not a list, an error is raised.

redis().llen("prod_eventlog_queue")
lrange(key, start, stop)

Returns the elements of the list stored at key in the range [start, stop]. If key does not exist, it’s value is treated as if it were an empty list. If key exists but is not a list, an error is raised.

The parameters start and stop are zero-based indexes. Negative numbers are converted to indexes by adding the length of the list, so that -1 is the last element of the list, -2 the second-to-last element of the list, and so on.

Indexes outside the range of the list are not an error: If both start and stop are less than 0 or greater than or equal to the length of the list, an empty list is returned. Otherwise, if start is less than 0, it is treated as if it were 0, and if stop is greater than or equal to the the length of the list, it is treated as if it were equal to the length of the list minus 1. If start is greater than stop, an empty list is returned.

Note that this method is subtly different from Python’s list slicing syntax, where list[start:stop] returns elements in the range [start, stop).

redis().lrange("prod_eventlog_queue", 0, 9)   # Returns *ten* elements!
redis().lrange("prod_eventlog_queue", 0, -1)  # Returns the entire list.
get(key)

Returns the string stored at key. If key does not exist, returns None. If key exists but is not a string, an error is raised.

redis().get("example_redis_key")
keys(pattern)

Returns list of keys from Redis matching pattern.

redis().keys("*downtime*")
hget(key, field)

Returns the value of the field field of the hash key. If key does not exist or does not have a field named field, returns None. If key exists but is not a hash, an error is raised.

redis().hget("example_hash_key", "example_field_name")
hgetall(key)

Returns a dict of all fields of the hash key. If key does not exist, returns an empty dict. If key exists but is not a hash, an error is raised.

redis().hgetall("example_hash_key")
scan(cursor[, match=None][, count=None])

Returns a set with the next cursor and the results from this scan. Please see the Redis documentation on how to use this function exactly: http://redis.io/commands/scan

redis().scan(0, 'prefix*', 10)
smembers(key)

Returns members of set key in Redis.

redis().smembers("zmon:alert:1")
ttl(key)

Return the time to live of an expiring key.

redis().ttl('lock')
statistics()

Returns a dict with general Redis statistics such as memory usage and operations/s. All values are extracted using the Redis INFO command.

Example result:

{
    "blocked_clients": 2,
    "commands_processed_per_sec": 15946.48,
    "connected_clients": 162,
    "connected_slaves": 0,
    "connections_received_per_sec": 0.5,
    "dbsize": 27351,
    "evicted_keys_per_sec": 0.0,
    "expired_keys_per_sec": 0.0,
    "instantaneous_ops_per_sec": 29626,
    "keyspace_hits_per_sec": 1195.43,
    "keyspace_misses_per_sec": 1237.99,
    "used_memory": 50781216,
    "used_memory_rss": 63475712
}

Please note that the values for both used_memory and used_memory_rss are in Bytes.

S3

Allows data to be pulled from S3 Objects.

s3()

Methods of S3

get_object_metadata(bucket_name, key)

Get the metadata associated with the given bucket_name and key. The metadata allows you to check for the existance of the key within the bucket and to check how large the object is without reading the whole object into memory.

Parameters:
  • bucket_name – the name of the S3 Bucket
  • key – the key that identifies the S3 Object within the S3 Bucket
Returns:

an S3ObjectMetadata object

class S3ObjectMetadata
exists()

Will return True if the object exists.

size()

Returns the size in bytes for the object. Will return -1 for objects that do not exist.

Example usage:

s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').exists()


s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').size()
get_object(bucket_name, key)

Get the S3 Object associated with the given bucket_name and key. This method will cause the object to be read into memory.

Parameters:
  • bucket_name – the name of the S3 Bucket
  • key – the key that identifies the S3 Object within the S3 Bucket
Returns:

an S3Object object

class S3Object
text()

Get the S3 Object data

json()

If the object exists, parse the object as JSON.

Returns:a dict containing the parsed JSON or None if the object does not exist.
exists()

Will return True if the object exists.

size()

Returns the size in bytes for the object. Will return -1 for objects that do not exist.

Example usage:

s3().get_object('my bucket', 'mykeypart1/my_text_doc.txt').text()

s3().get_object('my bucket', 'mykeypart1/my_json_doc.json').json()
list_bucket(bucket_name, prefix, max_items)

List the S3 Object associated with the given bucket_name, matching prefix. By default, listing is possible for up to 1000 keys, so we use pagination internally to overcome this.

Parameters:
  • bucket_name – the name of the S3 Bucket
  • prefix – the prefix to search under
  • max_items – the maximum number of objects to list. Defaults to 100.
Returns:

an S3FileList object

class S3FileList
files()

Returns a list of dicts like

{
    "file_name": "foo",
    "size": 12345,
    "last_modified": datetime.datetime(2017, 7, 16, 1, 1, 21, tzinfo=tzutc())
}

Example usage:

s3().list_bucket('my bucket', 'some_prefix').files()

files = s3().list_bucket('my bucket', 'some_prefix', 10000).files()  # for listing a lot of keys
last_modified = files[0]["last_modified"].isoformat()  # returns a string that can be passed to time()
age = time() - time(last_modified)

Scalyr

The scalyr() wrapper enables querying Scalyr from your AWS worker if the credentials have been specified for the worker instance(s).

count(query, minutes=5)

Run a count query against Scalyr, depending on number of queries you may run into rate limit.

scalyr().count(' ERROR ')
timeseries(query, minutes=30)

Runs a timeseries query agains Scalyr with more generous rate limits. (New time series are created on the fly by Scalyr)

facets(filter, field, max_count=5, minutes=30, prio='low')

This method is used to retrieve the most common values for a field.

By default the Scalyr wrapper uses https://www.scalyr.com/ as the default region. Overriding is possible using scalyr(scalyr_region='eu') if you want to use their Europe environment https://eu.scalyr.com/.

scalyr(scalyr_region='eu').count(' ERROR ')

SNMP

Provides a wrapper for SNMP functions listed below. SNMP checks require specifying hosts in the entities filter. The partial object snmp() accepts a timeout=seconds parameter, default is 5 seconds timeout. NOTE: this timeout is per answer, so multiple answers will add up and may block the whole check

memory()
snmp().memory()

Returns host’s memory usage statistics. All values are in KiB (1024 Bytes).

Example check result as JSON:

{
    "ram_buffer": 359404,
    "ram_cache": 6478944,
    "ram_free": 20963524,
    "ram_shared": 0,
    "ram_total": 37066332,
    "ram_total_free": 22963392,
    "swap_free": 1999868,
    "swap_min": 16000,
    "swap_total": 1999868,
}
load()
snmp().load()

Returns host’s CPU load average (1 minute, 5 minute and 15 minute averages).

Example check result as JSON:

{"load1": 0.95, "load5": 0.69, "load15": 0.72}
cpu()
snmp().cpu()

Returns host’s CPU usage in percent.

Example check result as JSON:

{"cpu_system": 0, "cpu_user": 17, "cpu_idle": 81}
df()
snmp().df()

Example check result as JSON:

{
    "/data/postgres-wal-nfs-example": {
        "available_space": 524287840,
        "device": "example0-2-stp-123:/vol/example_pgwal",
        "percentage_inodes_used": 0,
        "percentage_space_used": 0,
        "total_size": 524288000,
        "used_space": 160,
    }
}
logmatch()
snmp().logmatch()
interfaces()
snmp().interfaces()

Example check result as JSON:

{
    "lo": {
        "in_octets": 63481918397415,
        "in_discards": 11,
        "adStatus": 1,
        "out_octets": 63481918397415,
        "opStatus": 1,
        "out_discards": 0,
        "speed": "10",
        "in_error": 0,
        "out_error": 0
    },
    "eth1": {
        "in_octets": 55238870608924,
        "in_discards": 8344,
        "adStatus": 1,
        "out_octets": 6801703429894,
        "opStatus": 1,
        "out_discards": 0,
        "speed": "10000",
        "in_error": 0,
        "out_error": 0
    },
    "eth0": {
        "in_octets": 3538944286327,
        "in_discards": 1130,
        "adStatus": 1,
        "out_octets": 16706789573119,
        "opStatus": 1,
        "out_discards": 0,
        "speed": "10000",
        "in_error": 0,
        "out_error": 0
    }
}
get()
snmp().get('iso.3.6.1.4.1.42253.1.2.3.1.4.7.47.98.105.110.47.115.104', 'stunnel', int)

Example check result as JSON:

{
    "stunnel": 0
}

SQL

sql([shard])

Provides a wrapper for connection to PostgreSQL database and allows executing queries. All queries are executed in read-only transactions. The connection wrapper requires one parameters: list of shard connections. The shard connections must come from the entity definition (see database-entities). Example query for log database which returns a primitive long value:

sql().execute("SELECT count(*) FROM zl_data.log WHERE log_created > now() - '1 hour'::interval").result()

Example query which will return a single dict with keys a and b:

sql().execute('SELECT 1 AS a, 2 AS b').result()

The SQL wrapper will automatically sum up values over all shards:

sql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards)

It’s also possible to query a single shard by providing its name:

sql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard

It’s also possible to query another database on the same server overwriting the shards information:

sql(shards={'customer_db' : entity['host'] + ':' + str(entity['port']) + '/another_db'}).execute('SELECT COUNT(1) AS c FROM my_table').results()

To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter:

[
    {
        "type":        "database",
        "name":        "customer",
        "environment": "live",
        "role":        "master"
    }
]

The check command will have the form

>>> sql().execute('SELECT 1 AS a').result()
8
# Returns a single value: the sum over the result from all shards

>>> sql().execute('SELECT 1 AS a').results()
[{'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}]
# Returns a list of the results from all shards

>>> sql(shard='customer1').execute('SELECT 1 AS a').results()
[{'a': 1}]
# Returns the result from the specified shard in a list of length one

>>> sql().execute('SELECT 1 AS a, 2 AS b').result()
{'a': 8, 'b': 16}
# Returns a dict of the two values, which are each the sum over the result from all shards

The results() function has several additional parameters:

sql().execute('SELECT 1 AS ONE, 2 AS TWO FROM dual').results([max_results=100], [raise_if_limit_exceeded=True])
max_results
The maximum number of rows you expect to get from the call. If not specified, defaults to 100. You cannot have an unlimited number of rows. There is an absolute maximum of 1,000,000 results that cannot be overridden. Note: If you require processing of larger dataset, it is recommended to revisit architecture of your monitoring subsystem and possibly move logic that does calculation into external web service callable by ZMON 2.0.
raise_if_limit_exceeded
Raises an exception if the limit of rows would have been exceeded by the issued query.
orasql()

Provides a wrapper for connection to Oracle database and allows executing queries. All queries are executed in read-only transactions. The connection wrapper requires three parameters: host, port and sid, that must come from the entity definition (see database-entities). One idiosyncratic behaviour to be aware, is that when your query produces more than one value, and you get a dict with keys being the column names or aliases you used in your query, you will always get back those keys in uppercase. For clarity, we recommend that you write all aliases and column names in uppercase, to avoid confusion due to case changes. Example query of the simplest query, which returns a single value:

orasql().execute("SELECT 'OK' from dual").result()

Example query which will return a single dict with keys ONE and TWO:

orasql().execute('SELECT 1 AS ONE, 2 AS TWO from dual').result()

To execute a SQL statement on a LIVE server, tagged with the name business_intelligence, for example, use the following entity filter:

[
    {
        "type":        "oracledb",
        "name":        "business_intelligence",
        "environment": "live",
        "role":        "master"
    }
]
exacrm()

Provides a wrapper for connection to the CRM Exasol database executing queries. The connection wrapper requires one parameter: the query.

Example query:

exacrm().execute("SELECT 'OK';").result()

To execute a SQL statement on the itr-crmexa* servers use the following entity filter:

[
   {
       "type": "host",
        "host_role_id": "117"
   }
]
mysql([shard])

Provides a wrapper for connection to MySQL database and allows executing queries. The connection wrapper requires one parameters: list of shard connections. The shard connections must come from the entity definition (see database-entities). Example query of the simplest query, which returns a single value:

mysql().execute("SELECT count(*) FROM mysql.user").result()

Example query which will return a single dict with keys h and u:

mysql().execute('SELECT host AS h, user AS u FROM mysql.user').result()

The SQL wrapper will automatically sum up values over all shards:

mysql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards)

It’s also possible to query a single shard by providing its name:

mysql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard

To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter:

[
    {
        "type":        "mysqldb",
        "name":        "lounge",
        "environment": "live",
        "role":        "master"
    }
]

TCP

This function opens a TCP connection to a host on a given port. If the connection succeeds, it returns ‘OK’. The host can be provided directly for global checks or resolved from entities filter. Assuming that we have an entity filter type=host, the example below will try to connect to every host on port 22:

tcp().open(22)

Zomcat

Retrieve zomcat instance status (memory, CPU, threads).

zomcat().health()

This would return a dict like:

{
    "cpu_percentage": 5.44,
    "gc_percentage": 0.11,
    "gcs_per_sec": 0.25,
    "heap_memory_percentage": 6.52,
    "heartbeat_enabled": true,
    "http_errors_per_sec": 0.0,
    "jobs_enabled": true,
    "nonheap_memory_percentage": 20.01,
    "requests_per_sec": 1.09,
    "threads": 128,
    "time_per_request": 42.58
}

Most of the values are retrieved via JMX:

cpu_percentage
CPU usage in percent (retrieved from JMX).
gc_percentage
Percentage of time spent in garbage collection runs.
gcs_per_sec
Garbage collections per second.
heap_memory_percentage
Percentage of heap memory used.
nonheap_memory_percentage
Percentage of non-heap memory (e.g. permanent generation) used.
heartbeat_enabled
Boolean indicating whether heartbeat.jsp is enabled (true) or not (false). If /heartbeat.jsp cannot be retrieved, the value is null.
http_errors_per_sec
Number of Tomcat HTTP errors per second (all 4xx and 5xx HTTP status codes).
jobs_enabled
Boolean indicating whether jobs are enabled (true) or not (false). If /jobs.monitor cannot be retrieved, the value is null.
requests_per_sec
Number of HTTP/AJP requests per second.
threads
Total number of threads.
time_per_request
Average time in milliseconds per HTTP/AJP request.

Helper Functions

The following general-purpose functions are available in check commands:

abs(number)

Returns the absolute value of the argument. Does not have overflow issues.

>>> abs(-1)
1
>>> abs(1)
1
>>> abs(-2147483648)
2147483648
all(iterable)

Returns True if none of the elements of iterable are falsy.

>>> all([4, 2, 8, 0, 3])
False

>>> all([])
True
any(iterable)

Returns True if at least one element of iterable is truthy.

>>> any([None, [], '', {}, 0, 0.0, False])
False

>>> any([])
False
avg(results)

Returns the arithmetic mean of the values in results. Returns None if there are no values. results must not be an iterator.

>>> avg([1, 2, 3])
2.0

>>> avg([])
None
basestring()

Superclass of str and unicode useful for checking whether a value is a string of some sort.

>>> isinstance('foo', basestring)
True
>>> isinstance(u'ˈ', basestring)
True
bin(n)

Returns a string of the given integer in binary representation.

>>> bin(1000)
'0b1111101000'
bool(x)

Returns True if x is truthy, and False otherwise. Does not parse strings. Also usable to check whether a value is Boolean.

>>> bool(None)
False

>>> bool('False')
True

>>> isinstance(False, bool)
True
chain(*iterables)

Returns an iterator that that yields the elements of the first iterable, followed by the elements of the second iterable, and so on.

>>> list(chain([1, 2, 3], 'abc'))
[1, 2, 3, 'a', 'b', 'c']

>>> list(chain())
[]
chr(n)

Returns the character for the given ASCII code.

>>> chr(65)
'A'
class Counter([iterable-or-mapping])

Creates a specialized dict for counting things. See the official Python documentation for details.

dict([mapping][, **kwargs])

Creates a new dict. Usually, using a literal will be simpler, but the function may be useful to copy dicts, to covert a list of key/value pairs to a dict, or to check whether some object is a dict.

>>> dict(a=1, b=2, c=3)
{'a': 1, 'c': 3, 'b': 2}

>>> dict({'a': 1, 'b': 2, 'c': 3})
{'a': 1, 'c': 3, 'b': 2}   # This is a copy of the original dict.

>>> dict([['a', 1], ['b', 2], ['c', 3]])
{'a': 1, 'c': 3, 'b': 2}

>>> isinstance({}, dict)
True
divmod(x, y):

Performs integer division and modulo as a single operation.

>>> divmod(23, 5)
(4, 3)
empty(v)

Indicates whether v is falsy. Equivalent to not v.

>>> empty([])
True

>>> empty([0])
False
enumerate(iterable[, start=0])

Generates tuples (start + 0, iterable[0]), (start + 1, iterable[1]), .... Useful to have access to the index in a loop.

>>> list(enumerate(['a', 'b', 'c'], start=1))
[(1, 'a'), (2, 'b'), (3, 'c')]
filter(function, iterable)

Returns a list of all objects in iterable for which function returns a truthy value. If function is None, the returned list contains all truthy objects in iterable.

>>> filter(lambda n: n % 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
[1, 2, 4, 5, 7, 8, 10]

>>> filter(None, [False, None, 0, 0.0, '', [], {}, 1000])
[1000]
float(x)

Returns x as a floating-point number. Parses stings.

>>> float('2.5')
2.5

>>> float('-inf')
-inf

>>> float(2)
2.0

This is useful to force proper division:

>>> 2 / 5
0

>>> float(2) / 5
0.4

Also usable to check whether a value is a floating-point number:

>>> isinstance(2.5, float)
True

>>> isinstance(2, float)
False
groupby(iterable[, key])

A somewhat obscure function for grouping consecutive equal elements in an iterable. See the official Python documentation for more details.

>>> [(k, list(v)) for k, v in groupby('abba')]
[('a', ['a']), ('b', ['b', 'b']), ('a', ['a'])]
hex(n)

Returns a string of the given integer in hexadecimal representation.

>>> hex(1000)
'0x3e8'
int(x[, base])

Returns x as an integer. Truncates floating-point numbers and parses strings. Also usable to check whether a value is an integer.

>>> int(2.5)
2

>>> int(-2.5)
2

>>> int('2')
2

>>> int('abba', 16)
43962

>>> isinstance(2, int)
True
isinstance(object, classinfo)

Indicates whether object is an instance of the given class or classes.

>>> isinstance(2, int)
True

>>> isinstance(2, (int, float))
True

>>> isinstance('2', int)
False
json(s)

Converts the given JSON string to a Python object.

>>> json('{"list": [1, 2, 3, 4]}')
{u'list': [1, 2, 3, 4]}
jsonpath_flat_filter(obj, path)

Executes json path expression using jsonpath_rw and returns a flat dict of (full_path, value).

>>> data = {"timers":{"/api/v1/":{"m1.rate": 12, "99th": "3ms"}}}
>>> jsonpath_flat_filter(data, "timers.*.*.'m1.rate'")
{"timers./api/v1/.m1.rate": 12}
jsonpath_parse(path)

Creates a json path parse object from the jsonpath_rw to be used in your check command.

len(s)

Returns the length of the given collection.

>>> len('foo')
3

>>> len([0, 1, 2])
3

>>> len({'a': 1, 'b': 2, 'c': 3})
3
list(iterable)

Creates a new list. Usually, using a literal will be simpler, but the function may be useful to copy lists, to covert some other iterable to a list, or to check whether some object is a list.

>>> list({'a': 1, 'b': 2, 'c': 3})
['a', 'c', 'b']

>>> list(chain([1, 2, 3], 'abc'))
[1, 2, 3, 'a', 'b', 'c']   # Without the list call, this would be a chain object.

>>> isinstance([1, 2, 3], list)
True
long(x[, base])

Converts a number or string to a long integer.

>>> long(2.5)
2L

>>> long(-2.5)
-2L

>>> long('2')
2L

>>> long('abba', 16)
43962L
map(function, iterable)

Calls function on each element of iterable and returns the results as a list.

>>> map(lambda n: n**2, [0, 1, 2, 3, 4, 5])
[0, 1, 4, 9, 16, 25]
max(iterable)

Returns the greatest element of iterable. With two or more arguments, returns the greatest argument instead.

>>> max([2, 4, 1, 3])
4

>>> max(2, 4, 1, 3)
4
min(iterable)

Returns the smallest element of iterable. With two or more arguments, returns the smallest argument instead.

>>> min([2, 4, 1, 3])
1

>>> min(2, 4, 1, 3)
1
normalvariate(mu, sigma)

Returns a normally distributed random variable with the given mean and standard derivation.

>>> normalvariate(0, 1)
-0.1711153439880709
oct(n)

Returns a string of the given integer in octal representation.

>>> oct(1000)
'01750'
ord(n)

Returns the ASCII code of the given character.

>>> ord('A')
65
pow(x, y[, z])

Computes x to the power of y. The result is modulo z, if z is given, and the function is much, much faster than (x ** y) % z in that case.

>>> pow(56876845793546543243783543735425734536873, 12425445412439354354394354397364398364378, 10)
9L
range([start, ]stop[, step])

Returns a list of integers [start, start + step * 1, start + step * 2, ...] where all integers are less than stop, or greater than stop if step is negative.

>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> range(1, 11)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> range(1, 1)
[]
>>> range(11, 1)
[]
>>> range(0, 10, 3)
[0, 3, 6, 9]
>>> range(10, -1, -1)
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
reduce(function, iterable[, initializer])

Calls function(r, e) for each element e in iterable, where r is the result of the last such call, or initializer for the first such call. If iterable has no elements, returns initializer.

If initializer is ommitted, the first element of iterable is removed and used in place of initializer. In that case, an error is raised if iterable has no elements.

>>> reduce(lambda a, b: a * b, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1)
3628800  # 10!

Note: Because of a Python bug, reduce used to be unreliable. This issue should now be fixed.

reversed(iterable)

Returns an iterator that iterates over the elements in iterable in reverse order.

>>> list(reversed([1, 2, 3]))
[3, 2, 1]
round(n[, digits=0])

Rounds the given number to the given number of digits, rounding half away from zero.

>>> round(23.4)
23.0
>>> round(23.5)
24.0
>>> round(-23.4)
-23.0
>>> round(-23.5)
-24.0
>>> round(0.123456789, 3)
0.123
>>> round(987654321, -3)
987654000.0
set(iterable)

Returns a set built from the elements of iterable. Useful to remove duplicates from some collection.

>>> set([1, 2, 1, 4, 3, 2, 2, 3, 4, 1])
set([1, 2, 3, 4])
sorted(iterable[, reverse])

Returns a sorted list containing the elements of iterable.

>>> sorted([2, 4, 1, 3])
[1, 2, 3, 4]

>>> sorted([2, 4, 1, 3], reverse=True)
[4, 3, 2, 1]
str(object)

Returns the string representation of object. Also usable to check whether a value is a string. If the result would contain Unicode characters, the unicode() function must be used instead.

>>> str(2)
'2'

>>> str({'a': 1, 'b': 2, 'c': 3})
"{'a': 1, 'c': 3, 'b': 2}"

>>> isinstance('foo', str)
True
sum(iterable)

Returns the sum of the elements of iterable, or 0 if iterable is empty.

>>> sum([1, 2, 3, 4])
10

>>> sum([])
0
time([spec][, utc])

Given a time specification such as '-10m' for “ten minutes ago” or '+3h' for “in three hours”, returns an object representing that timestamp. Valid units are s for seconds, m for minutes, h for hours, and d for days.

The time specification spec can also be a Unix epoch/timestamp or a valid ISO timestamp in of the following formats: YYYY-MM-DD HH:MM:SS.mmmmm, YYYY-MM-DD HH:MM:SS, YYYY-MM-DD HH:MM or YYYY-MM-DD.

If spec is omitted, the current time is used. If utc is True the timestamp uses UTC, otherwise it uses local time.

The returned object has two methods:

isoformat([sep])

Returns the timestamp as a string of the form YYYY-MM-DD HH:MM:SS.mmmmmm. The default behavior is to omit the T between date and time. This can be overridden by passing the optional sep parameter to the method.

>>> time('+4d').isoformat()
'2014-03-29 18:05:50.098919'

>>> time(1396112750).isoformat()
'2014-03-29 18:05:50'

>>> time('+4d').isoformat('T')
'2014-03-29T18:05:50.098919'
format(fmt)

Returns the timestamp as a string formatted according to the given format. See the official Python documentation for an incomplete list of supported format directives.

Additionally, the subtraction operator is overloaded and returns the time difference in seconds:

>>> time('2014-01-01 01:13') - time('2014-01-01 01:01')
12
timestamp()

Returns Unix time stamp. This wraps time.time()

tuple(iterable)

Returns the given iterable as a tuple (an immutable list, basically). Also usable to check whether a value is a tuple.

>>> tuple([1, 2, 3])
(1, 2, 3)
>>> isinstance((1, 2, 3), tuple)
True
unicode(object)

Returns the string representation of object as a Unicode string. Also usable to check whether a value is a Unicode string.

>>> unicode({u'α': 1, u'β': 2, u'γ': 3})
u"{u'\\u03b1': 1, u'\\u03b3': 3, u'\\u03b2': 2}"

>>> isinstance(u'ˈ', unicode)
True
unichr(n)

Returns the unicode character with the given code point. Might be limited to code points less than 0x10000.

>>> unichr(0x2a13)  # LINE INTEGRATION WITH SEMICIRCULAR PATH AROUND POLE
u'⨓'
xrange([start, ]stop[, step])

As range(), but returns an iterator rather than a list.

zip(*iterables)

Returns a list of tuples where the i-th tuple contains the i-th element from each of the given iterables. Uses the lowest length if the iterables have different lengths.

>>> zip(['a', 'b', 'c'], [1, 2, 3])
[('a', 1), ('b', 2), ('c', 3)]
>>> zip(['A', 'B', 'C'], ['a', 'b', 'c'], [1, 2, 3])
[('A', 'a', 1), ('B', 'b', 2), ('C', 'c', 3)]
>>> zip([], [1, 2, 3])
[]
re()

Python regex re module for all regex operations.

>>> re.match(r'^ab.*', 'a123b') != None
False

>>> re.match(r'^ab.*', 'ab123') != None
True
math()

Python math module for all math operations.

>>> math.log(4, 2)
2.0