Using Language Models for Threat Intelligence Triage at Scale -- T34ch Tech

The average threat intelligence team receives more reports in a week than it can read in a month. Vendor advisories, ISAC bulletins, open-source intelligence feeds, government alerts, blog posts from security researchers, indicators of compromise from sharing platforms -- the volume is not the problem anyone set out to solve, but it is the problem that now dominates the workflow. The actual analytical work -- determining relevance, mapping activity to techniques, identifying patterns, informing defensive decisions -- happens in the narrow window of time left over after triage.

Language models offer a potential solution to the volume problem. Not to the analysis problem -- that requires human judgment -- but to the initial triage: reading a report, determining what it describes, mapping the described behavior to a structured framework, and routing the result to the right analyst. This is classification work, and large language models are remarkably capable classifiers, even without task-specific training data.

This article covers the practical application of language models to threat intelligence triage. The focus is on zero-shot and few-shot classification against the MITRE ATT&CK framework, because that is the most common structured taxonomy in the field and the one most likely to be useful for downstream consumption. We will cover what works, what fails, the specific failure modes you should expect, and the architecture of a pipeline that is honest about the model's limitations.

The Volume Problem in Concrete Terms

To understand why automated triage matters, consider the numbers. A mid-sized threat intelligence team -- four to six analysts -- might subscribe to ten to fifteen intelligence feeds. Each feed produces between five and fifty reports per day. That is 50 to 750 reports daily, or roughly 250 to 3,750 per working week. Each report ranges from a paragraph-length indicator bulletin to a 40-page APT campaign analysis.

An experienced analyst can meaningfully triage a report in two to ten minutes, depending on length and complexity. Call it five minutes on average. At 250 reports per week, that is roughly 21 hours of triage -- more than half of one analyst's working week consumed entirely by reading and sorting, before any actual analysis begins.

At 750 reports per week, triage alone exceeds the capacity of the entire team. Reports get skimmed, deprioritized by source reputation rather than content, or simply ignored. The reports that get ignored are not necessarily the least important ones. They are the ones that arrived when the queue was already full.

This is the gap that automated triage addresses. Not replacing the analyst's judgment about what matters, but handling the mechanical classification work fast enough that the analyst's time goes to analysis rather than sorting.

Zero-Shot Classification: The Core Technique

Traditional text classification requires labeled training data: hundreds or thousands of examples of each category, annotated by humans, used to train a model that can then classify new text into those categories. Building that training corpus for threat intelligence is expensive, domain-specific, and perpetually out of date as new techniques emerge.

Key term: Zero-shot classification A classification approach where the model receives no labeled training examples for the target categories. Instead, the model uses its pre-trained understanding of language to match input text against category descriptions provided at inference time. In the context of ATT&CK mapping, this means giving the model a threat report and a list of technique descriptions, and asking it to identify which techniques the report describes -- without ever training the model on labeled ATT&CK-mapped reports.

Zero-shot classification sidesteps the training data problem entirely. Instead of learning the mapping from examples, the model uses its pre-existing understanding of language and its knowledge of cybersecurity concepts (absorbed during pre-training on internet-scale text) to perform the classification at inference time.

The practical mechanism depends on the model architecture. For generative models (GPT-4, Claude, Llama), you construct a prompt that presents the report text and asks the model to identify relevant ATT&CK techniques. For embedding models (sentence-transformers, text-embedding-ada-002), you encode both the report text and the technique descriptions into vector representations and measure similarity.

Both approaches work. They fail in different ways, and the choice between them depends on your latency requirements, cost constraints, and tolerance for specific error types.

Generative Approach: Prompting for Classification

The generative approach treats classification as a structured generation task. You provide the model with a report and ask it to produce a structured output identifying the ATT&CK techniques described in the report.

A minimal prompt looks something like this:

You are a threat intelligence analyst. Read the following report and identify all MITRE ATT&CK techniques described in the text. For each technique, provide the technique ID, the technique name, and a one-sentence justification citing the specific behavior in the report that maps to that technique. If the report does not describe any recognizable ATT&CK techniques, say so.

Report: [report text]

Output format: JSON array of objects with fields: technique_id, technique_name, justification

This produces usable results out of the box for well-written, specific threat reports. A report that says "the actor used a scheduled task to maintain persistence" will reliably map to T1053.005. A report that describes "credential dumping via LSASS memory access" will map to T1003.001. The model has seen enough cybersecurity text during pre-training to make these connections without explicit examples.

Embedding Approach: Semantic Similarity

The embedding approach skips generation entirely. You pre-compute vector embeddings for every ATT&CK technique description. At triage time, you embed the report (or chunks of it) and compute cosine similarity against the technique embeddings. Techniques above a similarity threshold are candidate matches.

This is faster, cheaper, and deterministic -- the same input always produces the same output. It also fails more gracefully: instead of hallucinating a technique ID that does not exist, it returns the nearest real technique in embedding space, even if the similarity score is low. The score itself serves as a confidence indicator.

The tradeoff is granularity. Embedding similarity captures semantic relatedness but does not distinguish between a report that describes a technique and a report that mentions a technique in passing. A report about lateral movement that briefly references credential theft to provide context will produce non-trivial similarity scores for both the lateral movement techniques and the credential access techniques. The generative approach, prompted correctly, can distinguish between "the report describes this technique" and "the report mentions this technique." The embedding approach cannot.

Fig. 01 -- Generative vs. embedding classification pipelines

Two approaches to LLM-based ATT&CK mapping. Generative models produce richer output but risk hallucinated technique IDs. Embedding models are faster and deterministic but cannot distinguish between techniques a report describes in detail and techniques it mentions in passing. Both should feed an analyst review queue, not a production system directly.

Prompt Engineering for ATT&CK Classification

Prompt design has more impact on classification quality than model selection. A well-prompted smaller model consistently outperforms a poorly prompted larger model on this task. The key design decisions are specificity of instruction, output format constraints, and whether to provide technique descriptions or rely on the model's pre-trained knowledge.

Providing the Taxonomy vs. Relying on Pre-Training

You have two options. First, you can include ATT&CK technique descriptions in the prompt, asking the model to match the report against the provided descriptions. Second, you can rely on the model's pre-trained knowledge of ATT&CK, providing only the instruction to identify techniques.

The first approach is more reliable but hits context window limits quickly. ATT&CK Enterprise has over 200 techniques and 400 sub-techniques. Their descriptions total roughly 150,000 tokens. No current model can hold the full taxonomy and a substantial report in a single context window. You must either pre-filter the taxonomy (using embeddings to select the 20-30 most likely techniques before prompting) or decompose the task across multiple calls.

The second approach fits in a single prompt but introduces a failure mode: the model may reference technique IDs from memory that are outdated, misnumbered, or entirely fabricated. Models trained before a particular ATT&CK update will not know about techniques added in that update. Models with imprecise memorization of the taxonomy will confidently produce IDs that are close to correct but wrong -- T1059.001 instead of T1059.003, for example. This is not a rare edge case. In testing, roughly 8-15% of technique IDs produced by generative models without taxonomy grounding contain errors in the sub-technique number.

Output Format and Validation

Always request structured output (JSON) and always validate it against the current ATT&CK database before passing it downstream. The validation step catches hallucinated technique IDs, malformed output, and version mismatches. It takes milliseconds and prevents the most damaging class of errors.

A proper validation layer does three things: confirms each technique ID exists in the current ATT&CK version, confirms the technique name matches the ID (catching cases where the model produced a valid ID with the wrong name), and flags any output where the model expressed uncertainty. The third check requires the prompt to instruct the model to indicate confidence, which most models will do if asked explicitly.

Few-Shot Prompting: When It Helps

Few-shot prompting -- providing two to five examples of correct classifications in the prompt before presenting the new report -- improves consistency significantly. It does not improve the model's knowledge of ATT&CK (the model either knows the techniques or it does not), but it calibrates the output format, the level of justification detail, and the threshold for what constitutes a match versus a mention.

The examples should cover the failure modes you care about most: a report that describes multiple techniques, a report that mentions techniques without describing them in operational detail, a short indicator bulletin with minimal context, and a long campaign analysis with many techniques interleaved. Four well-chosen examples have more impact than twenty mediocre ones.

Key concept: Prompt sensitivity in classification tasks Small changes in prompt wording produce measurably different classification results. The phrase "identify all ATT&CK techniques described in this report" produces different results than "list the ATT&CK techniques that this report provides evidence for." The first tends toward over-classification (listing techniques that are mentioned but not described). The second tends toward under-classification (omitting techniques that are clearly implied but not named). Test your prompt against a held-out set of manually classified reports and measure precision and recall before deploying.

Precision, Recall, and the Triage Tradeoff

Any classification system makes two types of errors. False positives: labeling a report with a technique it does not actually describe. False negatives: failing to label a report with a technique it does describe. The relative cost of these errors determines how you should tune the system.

In threat intelligence triage, the cost asymmetry is clear. A false positive means an analyst spends two minutes reviewing a technique mapping that turns out to be wrong -- a minor time cost. A false negative means the analyst never sees a relevant technique mapping, potentially missing a connection between the report and an active threat to their environment.

This means you should tune for high recall at the expense of precision. It is better to surface too many candidate techniques and let the analyst filter than to miss relevant techniques silently. In practice, targeting 85-90% recall with 60-70% precision produces a workload the analyst can manage -- roughly one in three technique labels needs correction, but very few relevant labels are missed.

Fig. 02 -- Precision-recall tradeoff for triage classification

For triage, recall matters more than precision. A missed technique is invisible to the analyst. A false positive costs two minutes of review. Tune your system to the upper-right of the PR curve: accept more false positives to minimize false negatives. The exact operating point depends on your team's capacity and the downstream cost of missed intelligence.

Failure Modes: Where the Technique Breaks Down

Every system fails. The value of understanding failure modes is that you can design around them. LLM-based classification has several predictable failure modes, and each has a different mitigation.

Hallucinated Technique IDs

This is the most discussed failure mode and the easiest to mitigate. Generative models occasionally produce technique IDs that do not exist in ATT&CK. Sometimes the ID is structurally valid (Txxxx.xxx format) but refers to no real technique. Sometimes the model invents a plausible-sounding technique name and assigns it a fabricated ID.

Mitigation: Validate every technique ID against the current ATT&CK database. Reject any ID that does not resolve. Log rejected IDs -- they often indicate that the model identified a real behavior but could not map it correctly, which is useful feedback for prompt refinement.

Sub-Technique Confusion

Models frequently assign the correct parent technique but the wrong sub-technique. This is partially a precision problem (T1059 is correct but T1059.001 versus T1059.003 matters operationally) and partially a training data problem (the model's pre-training data may not distinguish sub-techniques consistently).

Mitigation: For triage purposes, consider scoring at the parent technique level and flagging sub-technique assignments as tentative. An analyst can quickly confirm T1059.003 (Windows Command Shell) versus T1059.001 (PowerShell) given the parent technique is correct. Requiring the model to get the sub-technique right without review is asking for more precision than the application requires.

Context Window Limits and Long Reports

A 40-page APT campaign analysis may exceed the model's context window. Even models with 128K-token windows struggle with very long inputs because attention degrades over distance -- information in the middle of a long document receives less attention than information at the beginning and end. This is a known property of transformer architectures.

Mitigation: Chunk long reports into sections and classify each section independently. Aggregate technique labels across sections and deduplicate. This loses cross-section context (the model cannot reason about relationships between findings in different sections) but produces more reliable per-section classification. For the cross-section reasoning, you need an analyst.

Ambiguous Language and Implied Techniques

Not all threat reports use precise technical language. A report might say "the actor obtained valid credentials" without specifying whether this was phishing (T1566), credential stuffing (T1110), or purchase from an initial access broker (T1589.001). The model must either pick one (risking a wrong classification), pick all plausible options (reducing precision), or flag the ambiguity (adding analyst workload).

The best approach is the third: flag ambiguous mappings explicitly. A prompt instruction like "if the report does not provide enough detail to distinguish between related techniques, list all plausible candidates and note the ambiguity" produces output that is honest about uncertainty. Analysts can resolve the ambiguity in seconds using their domain knowledge. The model cannot, and pretending it can produces confident wrong answers.

Temporal and Version Drift

ATT&CK is versioned. Techniques are added, deprecated, merged, and renumbered across versions. A model trained on text that references ATT&CK v12 will produce different IDs than a model that has seen v14. If your validation layer uses the current version and the model produces IDs from an older version, valid classifications will be incorrectly rejected.

Mitigation: Maintain a mapping table across ATT&CK versions. When a technique ID fails validation against the current version, check whether it maps to a deprecated or renumbered technique and translate automatically. This is a small engineering investment that eliminates a large class of false rejections.

Key concept: Confident wrong answers are worse than uncertain right answers The most dangerous failure mode of LLM classification is not hallucination or missed techniques. It is confident misclassification that passes validation. If the model labels a report with T1078 (Valid Accounts) when the actual technique is T1133 (External Remote Services), and the ID is valid and the justification sounds plausible, the analyst may not catch the error. Design your system to surface uncertainty rather than suppress it. A classification labeled "low confidence" gets checked. A classification labeled "T1078" with a plausible justification gets accepted.

Pipeline Architecture for Production Use

A triage pipeline that uses LLMs in production needs more than a model and a prompt. It needs ingestion, preprocessing, classification, validation, routing, and feedback loops. Each stage has specific design requirements.

Fig. 03 -- End-to-end threat intel triage pipeline

A production triage pipeline uses embeddings for fast candidate selection and generative models for detailed classification against the narrowed candidate set. This two-stage approach keeps cost and latency manageable while preserving classification quality. The feedback loop from analyst corrections is essential -- without it, the system cannot improve.

The Two-Stage Architecture

The most effective production architecture uses both approaches in sequence. The embedding model runs first as a fast, cheap filter. It identifies the 15-25 most likely ATT&CK techniques for a given report based on semantic similarity. This reduces the classification problem from 600+ techniques to a manageable candidate set.

The generative model runs second, receiving only the report and the candidate techniques. This dramatically reduces the context window requirement (the full descriptions of 20 techniques fit comfortably in any modern model's context) and focuses the model's attention on distinguishing between related techniques rather than searching the entire taxonomy.

This two-stage approach cuts API costs by roughly 80% compared to sending the full taxonomy to the generative model, reduces latency to under 10 seconds per report, and improves classification accuracy because the generative model is working with a focused candidate set rather than an overwhelming taxonomy.

The Feedback Loop

Every analyst correction is training data. When an analyst changes a technique label, removes a false positive, or adds a missed technique, that correction should flow back into the system. Not as model fine-tuning (which is expensive and introduces its own problems), but as few-shot examples in the prompt and as calibration data for embedding thresholds.

Over time, the few-shot examples in your prompt should be drawn from your own corrected classifications, not from generic examples. Your organization's threat landscape, your analysts' standards for what constitutes a match, and your specific intelligence requirements are all reflected in the corrections. Using them as examples teaches the model your team's classification norms.

Human-in-the-Loop Design

The phrase "human in the loop" gets used loosely in ML system design. In the context of threat intelligence triage, it has a specific meaning: the LLM performs the initial classification, and a human analyst reviews, corrects, and approves the classification before it enters the production intelligence database.

This is not optional. It is a design requirement.

The reason is not that the model is unreliable -- though it is, in the ways described above. The reason is that threat intelligence triage is not purely a classification problem. It is a relevance problem. Whether a report describes T1059.001 is a factual question the model can answer reasonably well. Whether that classification matters to your organization -- given your technology stack, your threat model, your current defensive priorities -- is a judgment that requires organizational context the model does not have.

A report about a campaign targeting Kubernetes environments is irrelevant to an organization that runs no Kubernetes. A report about a technique you have already detected and mitigated is lower priority than a report about a technique you have no coverage for. A report from a source with a history of inaccurate attribution needs more scrutiny than one from a trusted ISAC partner. None of these relevance judgments can be automated, because they depend on context that changes daily and lives in the analyst's head, not in the model's training data.

The practical design implication is that the LLM output should be presented as a draft classification that the analyst confirms, not as a finished product. The interface should make it easy to accept, modify, or reject each technique label. And the system should track the acceptance rate -- the fraction of model labels the analyst accepts without modification -- as a primary quality metric. If the acceptance rate drops below 60%, the prompt or the model needs attention. If it exceeds 90%, you are probably over-tuning toward the analyst's biases and should audit for blind spots.

What Language Models Cannot Replace

It is worth being explicit about what this approach does not automate, because the temptation to expand scope is strong.

Attribution. Determining which threat actor is responsible for described activity requires cross-referencing infrastructure, tooling, victimology, and operational patterns across multiple reports and historical data. LLMs can summarize what a report says about attribution. They cannot independently assess whether the attribution is credible, and they will confidently repeat incorrect attributions from the source text.

Novelty detection. Identifying genuinely new techniques or previously unseen tool variants requires understanding what is already known -- the full corpus of prior intelligence. A model classifying a single report cannot determine whether a technique is novel to the threat landscape. It can only determine whether the technique matches something in ATT&CK. If the technique is genuinely new, it will not match, and the model will either force-fit it to an existing technique or report no match. Neither output communicates "this is new and important."

Strategic intelligence. Trend analysis, threat landscape assessment, and predictive intelligence require synthesis across many reports over time. LLMs can summarize individual reports. They cannot synthesize a quarter's worth of intelligence into a strategic assessment, because that synthesis requires an analytical framework, institutional knowledge, and judgment about what matters that is not derivable from the text alone.

Source evaluation. Assessing the reliability and credibility of an intelligence source is a meta-analytical task that depends on the analyst's experience with that source over time. A model cannot evaluate whether a source that has been accurate for two years has recently become unreliable due to a compromise or a change in methodology.

These limitations are not temporary gaps that better models will close. They are structural consequences of what classification is and what analysis is. Classification maps text to categories. Analysis synthesizes information, applies judgment, and produces assessments that go beyond what any individual input contains. The former is automatable. The latter is not.

LLMs are effective at the mechanical layer of threat intelligence triage: reading reports, extracting described behaviors, and mapping those behaviors to structured taxonomies like ATT&CK. They reduce the volume problem enough that analysts can spend their time on analysis rather than sorting. They do not perform analysis. The pipeline architecture that works in practice is two-stage classification (embedding filter, then generative refinement), with mandatory human review, a validation layer that catches hallucinations, and a feedback loop that improves the system over time using analyst corrections. Build for recall over precision, surface uncertainty rather than suppress it, and never let model output enter a production database without analyst approval.

Getting Started: A Minimal Viable Pipeline

If you are building this for the first time, start simple. Do not build the full two-stage architecture on day one. Build the minimal version that delivers value, validate it against your analysts' manual classifications, and expand incrementally.

The minimal pipeline has four components: an ingestion script that pulls reports from your primary feed and extracts plain text, an embedding model that scores each report against the ATT&CK technique descriptions, a threshold filter that selects techniques above a cosine similarity of 0.70, and a simple web interface that presents the report alongside its candidate techniques for analyst review.

That is enough to start measuring. Track acceptance rate, track the techniques the model misses that analysts add, track the false positives analysts remove. Those metrics tell you whether the system is useful and where it needs improvement. After two weeks of corrections, you have enough data to add a generative refinement step with few-shot examples drawn from your own corrected classifications.

Do not optimize for the perfect system. Optimize for a system that saves your analysts time today and gets measurably better next month. The volume problem is not going away. The question is whether your team spends its capacity on sorting or on analysis. Even a mediocre automated triage system shifts that balance in the right direction, and a mediocre system with a feedback loop becomes a good system faster than you expect.

Fig. 04 -- Minimum viable pipeline components

Start with the minimum that delivers value: ingest, embed, threshold, review. Measure for two weeks. Then add generative refinement using your own corrections as few-shot examples. The feedback loop is what turns a mediocre MVP into a reliable system.

The threat intelligence volume problem is real, it is getting worse, and it is not going to be solved by hiring more analysts. Language models offer a practical path to automating the classification layer -- the mechanical work of reading, extracting, and mapping -- so that human analysts can focus on the judgment layer: relevance, novelty, strategic significance, and defensive action. That division of labor is the right one. The model reads. The analyst thinks. Neither is sufficient alone. Together, they handle a workload that neither could manage independently.