Onboarding Term Glossary Flashcards
Alert
An alert is the report of a potential problem that an integrated monitoring system has sent to BigPanda.
Monitoring tools generate events when potential problems are detected in your infrastructure. Over time status updates and repeat events may occur from the same system issue.In BigPanda, raw event data is merged into a singular alert so that you can visualize the life cycle of a detected issue over time.
For example, a CPU load alert may start with a warning event, then increase in severity with a critical event, and finally get resolved with a resolution event. All three of these events will be merged into a single alert.Common events that are sent to BigPanda include: “CPU > 95% for more than 5 minutes” and “Port X on Router ABC down”
BigPanda correlates related alerts into incidents for visibility into high-level, actionable problems.
NOTE: Some monitoringtools refer to ‘events’as ‘alarms’ or ‘alerts.’ In BigPanda documentation ‘alert’is always used to refer to the complete lifecycle of an event
Alert Correlation (Pt 1)
Alert correlation is a process of grouping related alerts into a single, high-level incident. BigPanda uses pattern recognition to automatically process the data generated by your monitoring systems and to dynamically cluster alerts into meaningful, actionable incidents. BigPanda provides default correlation patterns as well as the option to tailor patterns to your organization.
BigPanda ingests the raw event data from monitoring systems such as Nagios, CloudWatch, and systems integrated via the Alerts API. The data is normalized into standard tags and enriched with configuration information, operational categories and other custom tags. Then, the BigPanda alert correlation engine merges the events into alerts and clusters the alerts into high-level, actionable incidents by evaluating the properties against patterns in:
Topology -The host, host group, service, application, cloud, or other infrastructure element that emits the alerts. Alerts are more likely to be related when they come from the same area in your infrastructure.
Time -The rate at which related alerts occur. Alerts occurring around the same time are more likely to be related than alerts occurring far apart.
Context -The type of alerts. Some alert types imply a relationship between them, while others don’t
Alert Correlation (Pt 2)
As new alerts are received, BigPanda evaluates all matching patterns, and determines whether to update an existing incident or create a new incident. With this powerful algorithm, BigPanda can effectively and accurately correlate alerts to dramatically reduce your monitoring noise by as much as 90 –99% in some environments. Correlations occur in under 100ms so you see updates in real time for maximum visibility into critical problems.
You can customize correlation patterns to tailor alert correlation to the specifics of your infrastructure. Learn more about customizing alert correlation in the Managing Correlation Patterns documentation.
Understanding how BigPanda determines which events are correlated into an alert and which alerts are grouped together into incidents can help you configure and use BigPanda more effectively. Particularly if you are using theAlerts REST APIto develop a custom integration or the correlation editor to modify a correlation pattern. Learn more about the way BigPanda correlates alerts together in the Alert Correlation Logic documentation
Agile
Agile is a software development philosophy defined by core iterative development. There are many agile methods, but most of them entail short engineering cycles that include all main stages: planning, development itself, testing, and deployment. Each cycle takes one or two weeks. The idea behind Agile is shipment of the product as quickly as possible and incrementally updating it based on customer feedback. Agile methods remain the mainstream in modern software development as they support product adaptivity to the constantly changing market and customer needs
API
Application Program Interfaces(APIs) are software intermediary tools that allow applications to talk to each other. BigPanda has several APIs available that allow you to integrate with external tools and manage incidents and BigPanda elements in bulk. They are core tools for self-service driven customers, and empower custom solutions and deep 2-way integrations.
BigPanda API specifications can be found in the API Reference hub.
With each request to the BigPanda API, you must include an HTTP header with the authentication token for your organization. BigPanda APIs use two different types of authentication tokens, an organization-wide bearer token or a user-specific API Key.
The Alerts API builds a custom integration between BigPanda and your monitoring system. The Alerts API allows you to easily integrate a monitoring system with BigPanda. Monitoring systems generally send out events when problems are detected and when problems have been resolved (fixed)
Artificial Intelligence (AI)
Also known as machine intelligence, artificial intelligence(AI) is the ability for machine systems to mimic human cognitive functions such as learning and problem solving. The goal of artificial intelligence is to create machines or programs that can work, react, and respond to complex situations.
For most business initiatives, the focus of artificial intelligence is to design programs that can develop and progress in a specific task without using explicit instructions, allowing the program to rely on patterns and inference instead. Machine learning allows for a machine or program to develop and create a solution on its own, once limitations and standards are set, rather than simply following programing.
BigPanda’s Open Box Machine Learning combines the power of AI with transparency and customization through explainable AI. With BigPanda Open Box Machine Learning, the logic is explained to IT Operations teams in plain English. Teams can then edit this logic to add situational and tribal knowledge to strengthen it on their own, without requiring expert data scientists. From there, teams can test and run what-if experiments on real live production data to make sure their changes work as intended, before deploying them, promoting higher trust and adoption of machine learning throughout the organization.
The BigPanda Machine Learning Engine runs during alert correlation to suggest patterns that may improve correlation and during root cause analysis to highlight potential root causes of incidents
BigPanda Dashboards
Provide easy-to-read operational health metrics in a consolidated view. Ideal for NOC displays and status monitoring, each Dashboard is made up of a series of widgets showing color-coded key information on incident severity and status.
Each widget shows information for a single environment, making it easy to track incident metrics by region, team, or infrastructure types. For example, you might have environments for each business service so that you can track metrics on each separately
Description
Each monitoring tool is configured to send specific data in the description field of the event payload. This description data will be included with alerts and appear in incident details.For many integrations, the default description can be configured to include additional information. See the specific Integration instructions on the documentation site or in BigPanda for information about configuring the description field.
NOTE: Description is a reserved system word within BigPanda and cannot be changed or redefined for use in custom enrichment. When sending description fields to BigPanda ensure that description is lowercase only
Environment
Environments group related incidents together for improved automation and visibility.
Environments filter incidents on properties such as source and priority and group them together for easy visibility and action. Environments make it easy for your team to focus on the incidents relevant to their role and responsibilities.
BigPanda’s default environment is the All Incidents Environment. This environment includes every incident in BigPanda with no filter or limitations.Environments can be used to filter the Incident Feed, define AutoShare rules, create Dashboards, and view specific Analytics. Learn more about how environments enable BigPanda’s automation and advanced tools in the Environments documentation.
Your BigPanda environment groups can be customized to better fit the organizational structure and processes of your organization. Create, edit, or delete environment groups to help your teams stay focused on the most relevant information to them. Environment groups are managed from the Environments pane. Learn more in the Managing Environments documentation
Event
Monitoring tools generate events when potential problems are detected in your infrastructure. Over time status updates and repeat events may occur from the same system issue. In BigPanda, raw event data is merged into a singular alert so that you can visualize the life cycle of a detected issue over time.
For example, a CPU load alert may start with a warning event, then increase in severity with a critical event, and finally get resolved with a resolution event. All three of these events will be merged into a single alert. Common events that are sent to BigPanda include: “CPU > 95% for more than 5 minutes” and “Port X on Router ABC down”
BigPanda correlates related alerts into incidents for visibility into high-level, actionable incidents.
NOTE: Some monitoring tools refer to events as ‘alarms’ or ‘alerts.’ In BigPanda documentation ‘alert’ is always used to refer to the complete lifecycle of an event.
Flapping
Flapping occurs when a monitored object (ie: a service or host) changes state too frequently, making the cause and severity of the incident unclear. Flapping can be indicative of configuration problems (ie: thresholds set too low), troublesome services or real network problems.
When an alert changes states frequently, it may generate numerous events that are not immediately actionable.
In BigPanda, an incident enters the flapping state when one or more of the related alerts are flapping. By default, an alert is considered to be flapping when it has changed states more than 4 times in one hour. Contact BigPanda support if you need to configure custom logic (number of state changes within a period of time) for your organization or for a specific integration.
When an incident enters the flapping state, all subscribed users are notified and no additional state change notifications are sent. Subscribed users still receive a daily email reminding them about the incident. An incident exits the flapping state when all related alerts stop flapping (no longer meet the criteria for number of state changes in a period of time). BigPanda checks the flapping criteria every 15 minutes
Incident
An incident is essentially an unplanned interruption to an IT service or reduction in the quality of an IT service. It represents a high-level issue in your infrastructure. In BigPanda, incidents are created automatically by grouping together related alerts from your monitoring tools.
A single production issue often manifests itself in multiple alerts. For example, a disk issue can trigger a disk IO alert that, in turn, triggers a series of CPU, memory, database, and application alerts. Additionally, each alert may change as an issue progresses. An alert may start as a warning, and then increase in severity to a critical status. In these cases, diagnosing and fixing the issue requires up-to-date information from multiple sources, which is very difficult to gather and maintain manually.
BigPanda digests all of the raw data from your integrated monitoring systems and automatically correlates this complex data into single issue incidents, which gives you the visibility you need to investigate and resolve issues quickly.
All active and recently resolved incidents appear on the Incidents tab, where you can manage incidents through the operations workflow with BigPanda as your unified console. You can also escalate incidents through external ticketing and/or collaboration systems—manually as needed, or automatically as a smart ticketing solution—and BigPanda will keep the external systems up to date with the latest information.
The life cycle of an incident is defined by the life cycle of the alerts it contains. The incident feed provides a consolidated view of all active incidents from any integrated monitoring systems. After you’ve configured your integrations, you can use the incident feed to manage your incidents. The Incidents API allows you to manage BigPanda incidents externally, and can be configured with external ticketing and monitoring tools. It provides the Incidents object, which represents a BigPanda incident containing correlated alerts from your integrated monitoring systems
Incident_identifier
During alert correlation, BigPanda assigns correlated events an incident identifier. This id is used throughout the BigPanda system to recognize if two events are related to each other. Incident identifiers are created from the tags and event data sent to BigPanda for each event. By default, the incident identifier is a combination of the event’s host
and check
but it could be other fields depending on the properties of the correlating alerts The incident_identifier may also be called the incident_key.
NOTE: Incident-identifier is a reserved system word within BigPanda and cannot be changed or redefined for use in custom enrichment. When sending incident_identifier fields to BigPanda ensure that incident_identifier is lowercase only
Machine Learning
Machine learning is an important element of artificial intelligence. Machine learning focuses on the ability of a program to develop and progress in a specific task without using explicit instructions, allowing the program to rely on patterns and inference instead. Machine learning allows for a machine or program to develop and create a solution on its own once limitations and standards are set, rather than simply following programing.
BigPanda’s Open Box Machine Learning combines the power of AI with transparency and customization through “explainable AI”. With BigPanda Open Box Machine Learning, the logic is explained to IT Operations teams in plain English. Teams can then edit this logic to add situational and tribal knowledge to strengthen it on their own, without requiring expert data scientists. From there, teams can test and run what-if experiments on real live production data to make sure their changes work as intended, before deploying them, promoting higher trust and adoption of machine learning throughout the organization.
The BigPanda Machine Learning Engine runs during alert correlation to suggest patterns that may improve correlation and during root cause analysis to highlight potential root causes of incidents
MTTR
Mean time to repair/resolve(MTTR) is a maintenance metric that measures the average time required to troubleshoot and repair failed systems and equipment.
BigPanda’s AIOps combines your best-of-breed monitoring tools with automation, a single pane view, and collaborative streamlining to shorten your incident management lifecycle and dramatically improve your MTTR.