Research Orientation on AIOps for University Students

I am currently a second-year university student researching the application of artificial intelligence (AI) in cloud computing, such as traffic prediction and error forecasting. However, after reading numerous papers to identify gaps in my research, I find very few new or interesting directions to specialize in. I would greatly appreciate feedback and perspectives from experts. Thank you sincerely.

1 Like

I’m not very familiar with AIOps, so for now, I’ll just use GPT to summarize the current trends in research:


There are still good directions today in 2026.

The important shift is this: AIOps is no longer most interesting at the level of “predict one metric better.” The stronger questions now are about diagnosis, decision support, realistic evaluation, and operating newer systems such as LLM and GPU infrastructure. Recent benchmarks and surveys show that the field has moved from narrow detection toward multimodal reasoning over logs, metrics, traces, incident knowledge, and operator workflows. (arXiv)

Why you feel stuck

You started from a common student entry point: traffic prediction and error forecasting. That is logical. It is measurable, easy to prototype, and easy to find papers on. The problem is that those tasks are often supporting tasks, not the main operational objective. The cloud AIOps survey frames the real goals around detection, failure prediction, root-cause analysis, and actions that reduce MTTD and MTTR, while OpenTelemetry frames observability around answering “Why is this happening?” using traces, metrics, and logs together. That makes isolated prediction feel narrower than it first appears. (arXiv)

There is a second reason. Some parts of the literature are simply crowded. A survey of AIOps projects found that monitoring data such as logs and performance metrics are the most common inputs and that anomaly detection is the most common goal. That means a lot of visible work is concentrated in the same few problem formulations. If you keep searching inside that corridor, it will look saturated. (Heng Li)

There is a third reason. Benchmarks used to make some problems look easier than they are. RCAEval already had to introduce a benchmark with 735 failure cases and 15 reproducible baselines for microservice RCA, and newer benchmarks such as OpenRCA and Cloud-OpsBench exist because the community still thinks evaluation is incomplete and too far from realistic incident work. Fields do not keep building benchmarks this quickly when the problems are finished. (arXiv)

What changed by 2026

By 2026, two trends are very clear.

First, the field is becoming multimodal and agentic. Cloud-OpsBench argues that modern RCA should be evaluated as active reasoning rather than passive classification, and it introduces 452 fault cases across 40 root-cause types over the Kubernetes stack. OpenRCA does something similar from the LLM side, with 335 failures and over 68 GB of logs, metrics, and traces. The message is that the research frontier is now closer to “investigate like an operator” than “classify one snapshot.” (arXiv)

Second, LLM-based AIOps is real enough to study seriously, but still weak enough to leave major room for research. OpenRCA reports that even with a specially designed RCA-agent, the best-performing model solved only 11.34% of failure cases. A 2026 failure-analysis paper then ran 1,675 agent runs across five models and found recurring pitfalls such as hallucinated data interpretation and incomplete exploration. That is not a solved area. It is an immature area with visible failure modes. (OpenReview)

At the same time, the base layer has not changed: traces, metrics, and logs still matter, and context propagation still determines whether those signals can be tied together correctly in a distributed system. That means students who understand observability well are still positioned better than students who only understand modeling. (OpenTelemetry)

The best way to think about specialization

Do not ask:

What else can I predict?

Ask:

What decision in operations is still badly supported?

That one change in framing usually reveals better research topics.

A good specialization in AIOps today usually sits at one of these boundaries:

  • between observability and diagnosis
  • between diagnosis and operator action
  • between benchmarks and real incidents
  • between classical cloud systems and new AI infrastructure (OpenTelemetry)

Where the strongest opportunities are today

1. Multimodal root-cause analysis under incomplete telemetry

This is the best direction for most students.

The 2023 survey on AIOps for cloud platforms explicitly says there are very limited efforts on trace and multimodal failure prediction. The 2024 failure-diagnosis survey then frames microservice diagnosis as a multimodal problem involving logs, metrics, traces, events, and topology. OpenTelemetry’s official primer reinforces why this matters: each signal answers a different part of the debugging problem. (arXiv)

This creates a strong research question:

What can still be diagnosed when telemetry is missing, delayed, downsampled, or corrupted?

That question is strong because it is realistic. In real systems, traces are sampled, logs are noisy, metrics are delayed, and context propagation breaks. A method that works only under perfect observability is not very useful. (OpenTelemetry)

A thesis-sized version would be:

Robust RCA for cloud-native systems under partial observability

You would:

  • instrument a small microservice system
  • collect metrics, logs, and traces
  • inject several fault types
  • compare performance when one signal is missing or degraded

This is technically solid, experimentally manageable, and still underbuilt enough to matter. (arXiv)

2. Benchmarking and evaluation, not just new models

This area is less glamorous than model-building, but often more valuable.

RCAEval, OpenRCA, and Cloud-OpsBench together show that the field still needs better evaluation infrastructure: RCAEval standardized reproducible RCA benchmarking for microservices, OpenRCA exposed how hard long-context, multimodal RCA is for LLMs, and Cloud-OpsBench moved the evaluation target toward active tool use and deterministic reproducibility. That combination says the benchmark layer is still under construction. (arXiv)

A good student contribution here does not need to be “invent a novel architecture.” It can be:

  • a better failure-injection setup
  • a clearer evaluation protocol
  • a partial-observability benchmark variant
  • a benchmark for drift or telemetry misalignment
  • a comparison of diagnosis quality versus data-collection cost

That kind of work is publishable because the community is still trying to measure the right things. (arXiv)

3. Incident reports, postmortems, and historical incident knowledge

This is one of the most underused directions for students who like both systems and language.

AutoARTS studied over 2,000 incidents from more than 450 Azure services to build a better root-cause labeling system, which shows how important and messy incident knowledge is in practice. The LLM-era AIOps survey also shows that newer systems increasingly incorporate human-generated artifacts such as Q&A, software information, and incident reports rather than relying only on raw telemetry. (USENIX)

That suggests several strong questions:

  • how to retrieve similar past incidents during triage
  • how to use postmortems to improve ranking of root-cause candidates
  • how to clean noisy incident labels
  • how to generate grounded summaries that point to evidence, not just fluent text

A strong project here would be:

Telemetry plus postmortem retrieval for incident triage

That is better than a generic “LLM for AIOps” project because it has grounding, evaluation, and immediate practical value. (USENIX)

4. Decision-aware forecasting instead of plain forecasting

If you still like forecasting, keep it, but reframe it.

Forecasting is mature enough that “slightly better prediction accuracy” is often not a compelling research story by itself. More interesting is whether prediction improves autoscaling, SLO management, capacity planning, or bottleneck ranking. A recent 2026 review of distributed tracing and proactive SLO management emphasizes evaluation protocols for SLO violation prediction and actionable outputs such as bottleneck candidate ranking and what-if estimation. Meanwhile, the AIOps model-update study shows that once deployed, models must be actively maintained because operational data evolve over time. (Frontiers)

So instead of:

  • “predict traffic better”

move to:

  • “predict enough, early enough, and robustly enough to support SLO-safe control”

A good topic would be:

Drift-aware workload prediction for SLO-constrained autoscaling

That keeps your current interests but upgrades the operational meaning of the work. (arXiv)

5. AIOps for LLM and GPU systems

This is the freshest niche in 2026.

A 2026 study on GPU-driven LLM workloads tested 24 RCA methods and found that existing RCA tools do not generalize to these systems; multi-source approaches did best, metric-based methods depended heavily on the fault type, and trace-based methods largely failed. That is a strong sign that classical web-service AIOps is not enough for modern AI-serving stacks. (arXiv)

This is a very good area if you have access to:

  • a lab running inference or training workloads
  • GPU cluster telemetry
  • a simulator or controlled deployment setup

A strong topic would be:

Failure diagnosis for LLM inference services using multi-source observability

This is genuinely current. It also has a clear argument for why old methods are insufficient. (arXiv)

6. Model maintenance, drift, and lifecycle reliability

This is not fashionable, but it is very real.

The 2023 model-update paper says directly that when and how to update AIOps models remain an under-investigated topic and shows that active update strategies can outperform a stationary model in both performance and stability. If you want a topic that looks modest but teaches excellent research habits, this is one of the best. (arXiv)

This is especially relevant because operations data are not static. New deployments, new users, new software versions, new logging conventions, and new workloads all change the distribution. AIOps models that are good on day one and stale on day sixty are not good AIOps models. (arXiv)

A solid project here would be:

When should an RCA or forecasting model be retrained under workload drift?

That is practical, measurable, and deployment-relevant. (arXiv)


What looks crowded today

These areas are not useless. They are just harder to make important unless you bring a strong twist.

Plain traffic prediction on standard traces

Too many papers stop at forecast accuracy. The more meaningful work now ties prediction to SLOs, control, or cost. (Frontiers)

Log-only anomaly detection or log parsing

There is still active work here, including LLM-based methods, but this is one of the busiest lanes in the literature. It is easier to write incremental papers here than durable ones. (ScienceDirect)

Toy RCA on simplified microservice setups

RCAEval, OpenRCA, and Cloud-OpsBench exist precisely because older evaluation setups were not enough. Purely toy results are less convincing now. (arXiv)

Generic “LLM for AIOps” demos

The benchmark evidence is still sobering. OpenRCA’s best RCA-agent result is 11.34%, and the 2026 failure-analysis work shows consistent agent failure patterns. The bar is now much higher than “I prompted an LLM on logs.” (OpenReview)


What I would recommend for you specifically

For a second-year university student, I would optimize for three things at once:

  1. high learning value
  2. feasible experiments
  3. a topic with room for a real contribution

That points me to this ranking.

Best overall choice

Robust multimodal RCA under incomplete telemetry
Why: strong gap, realistic, good systems training, publishable without giant resources. (arXiv)

Best if you like NLP or LLMs

Incident retrieval and grounded triage from postmortems plus telemetry
Why: strong link to practice, less crowded than generic log parsing, and more grounded than vague LLM demos. (USENIX)

Best if you want to keep forecasting

Decision-aware forecasting for autoscaling or SLO support, with drift handling
Why: preserves your current skills but makes the problem more meaningful. (Frontiers)

Best if you want a fresh niche

AIOps for LLM/GPU systems
Why: very current and not yet well served by existing methods. (arXiv)


What I would do in your position over the next year

Step 1. Build observability literacy first

Before choosing a thesis, get very comfortable with:

  • what traces are
  • what metrics are
  • what logs are
  • how context propagation connects them

That is the substrate of almost every serious AIOps problem today. Without it, many papers look more magical than they are. (OpenTelemetry)

Step 2. Reproduce one benchmark

Pick one:

  • RCAEval if you want classical RCA on microservices
  • OpenRCA if you want LLM-based or multimodal RCA
  • Cloud-OpsBench if you want agentic RCA and tool use (arXiv)

Your goal here is not to beat the benchmark immediately. It is to understand what the data, labels, and failure cases really look like.

Step 3. Add one realism constraint

Choose one:

  • missing traces
  • noisy logs
  • delayed metrics
  • drift
  • weak labels
  • retrieval from past incidents
  • abstention or confidence calibration

One realism constraint is enough for a strong undergraduate project. (arXiv)

Step 4. Evaluate the right outcome

Do not evaluate only MAE, F1, or Top-1.

Also ask:

  • did triage get faster?
  • did the top-k root-cause ranking improve?
  • how robust was the method when observability degraded?
  • how costly is the method to run?
  • does it know when not to answer? (arXiv)

Step 5. Write the paper around the bottleneck, not the architecture

In AIOps, a valuable contribution can be:

  • a benchmark setup
  • a robustness study
  • a cleaner failure-injection protocol
  • a grounded triage pipeline
  • a realistic comparison under missing data

It does not need to be “a bigger model.” (arXiv)


Good topic statements you could actually use

These are all better than “AI for cloud traffic prediction.”

Safe and strong

Observability-driven root-cause analysis under partial telemetry in cloud-native systems (OpenTelemetry)

Strong if you like documents and LLMs

Grounded incident triage using telemetry and historical postmortems (USENIX)

Strong if you want systems plus forecasting

Drift-aware workload prediction for SLO-oriented autoscaling (arXiv)

Bold and current

Failure diagnosis for GPU-backed LLM inference services (arXiv)


Bottom line

Your problem is not that AIOps has no gaps left.

Your problem is that you started in one of the most crowded entry corridors.

Today, the field is more promising if you move from prediction-centric AIOps to decision-centric AIOps. The best opportunities are where observability is incomplete, labels are weak, evaluation is immature, or the systems themselves have changed faster than the diagnostic tools. That is exactly what the current surveys, benchmarks, and observability standards are telling us. (arXiv)

If I had to give you one direct recommendation, it would be this:

Specialize in observability-driven failure diagnosis for cloud systems under realistic constraints, especially incomplete telemetry.

That direction is current, technically serious, and still open enough for a second-year student to make a meaningful contribution. (arXiv)