Threat Hunting with ML and NLP -- T34ch Tech

Signature-based detection works until it does not. A hash changes, a domain rotates, an attacker rewrites their implant, and the rule that caught yesterday's campaign is useless against today's variant. This is not a failure of the signature. It is a failure of the model. Signatures encode what happened. They do not encode what is happening.

Machine learning offers a different premise: instead of describing known-bad patterns exactly, describe normal behavior statistically and flag deviations. Natural language processing extends that premise to the vast body of unstructured threat intelligence that humans produce but cannot search at scale. Together, they open a detection surface that signatures cannot reach.

This article is about building that detection surface with real tools, real data, and realistic expectations about what ML can and cannot do in a security operations context. If you are looking for a vendor pitch about AI-powered security, you are in the wrong place. If you are looking for practical guidance on turning security telemetry into features, features into models, and models into actionable alerts, read on.

The Detection Gap

The core problem is straightforward. Adversaries who know your detection stack can evade it. If your detections are signatures -- YARA rules, Snort rules, hash blocklists, static IOC feeds -- an attacker only needs to change the artifact. The behavior stays the same. The implementation changes. Your rule misses.

Living-off-the-land techniques make this worse. When an attacker uses powershell.exe, wmic.exe, certutil.exe, or mshta.exe to accomplish their objectives, there is no malicious binary to hash. The tool is legitimate. The usage is not. Distinguishing legitimate from illegitimate usage of a system binary requires understanding context: who ran it, when, from where, with what arguments, following what prior activity. That is a behavioral question, not a signature question.

The MITRE ATT&CK framework catalogs over 200 techniques across 14 tactics. Many of those techniques have multiple sub-techniques. Writing and maintaining signatures for every known implementation of every technique is not feasible. Writing signatures for unknown implementations is definitionally impossible. This is the detection gap: the space between what your signatures cover and what adversaries actually do.

Key term: Living off the land (LotL) Adversary techniques that use tools already present on the target system -- PowerShell, WMI, PsExec, native OS utilities -- rather than dropping custom malware. LotL techniques generate process execution events that look similar to legitimate administrative activity, making signature-based detection unreliable without behavioral context.

ML does not close the detection gap entirely. Nothing does. But it moves detection from pattern matching to pattern recognition: from "alert if you see exactly this" to "alert if this deviates from what is expected." That shift is the foundation of everything that follows.

Behavioral Baselining

Before you can detect anomalies, you need to know what normal looks like. This sounds obvious. In practice, it is the hardest part of the entire pipeline.

"Normal" in a network is not static. It shifts with business cycles, software deployments, personnel changes, seasonal patterns, and a hundred other factors. The baseline you built in January may not describe February. The baseline for the engineering team does not describe the finance team. The baseline for weekdays does not describe weekends.

A useful behavioral baseline has three properties. First, it is scoped: it describes a specific population (a host, a user, a department, a service account) over a specific time window. Second, it is statistical: it captures distributions, not just averages. Third, it is versioned: you know when it was built, from what data, and when it expires.

What to Baseline

Not all telemetry is equally useful for baselining. The signal-to-noise ratio varies enormously across data sources. In practice, these are the telemetry sources that produce the most useful baselines for threat hunting:

Process execution logs. Parent-child process relationships, command-line arguments, execution frequency by user and host. A process tree that has never appeared on a given host before is interesting. A command-line argument pattern that appears on one host but not its peers is interesting.

Authentication events. Login times, source IPs, authentication types, failed attempt rates, service account usage patterns. An account that normally authenticates from three workstations during business hours and suddenly authenticates from a server at 3 AM warrants investigation.

DNS queries. Query volume per host, domain age distribution, character entropy of queried domains, TXT record query frequency. DNS is the most reliably available telemetry source in most environments and one of the most useful for detecting C2, data exfiltration, and initial access.

Network flow data. Bytes transferred per connection, connection duration, port usage, internal-to-external ratio, beaconing periodicity. Netflow is low-fidelity compared to full packet capture but high-coverage and cheap to store.

Fig. 01 -- Telemetry sources and baseline value

Not all telemetry is created equal. Process execution, authentication events, and DNS queries sit in the highest-value quadrant: strong behavioral signal and broad availability. Full packet capture has the richest signal but is rarely stored at scale.

The Drift Problem

Baselines decay. A model trained on three months of authentication data from Q1 will see Q2's new hires, reorganized teams, and migrated services as anomalies. This is not a bug in the model. It is a feature of the environment. Environments change.

The practical consequence is that baselines require maintenance. You need a retraining cadence, a mechanism to detect when the baseline has drifted enough to produce excessive false positives, and a process for incorporating validated anomalies back into the baseline as "new normal."

Most teams that fail with ML-based detection fail here. Not because the model was bad. Because the model was good once and nobody maintained it.

Feature Engineering for Security Data

Raw logs are not features. A Sysmon event with 30 fields is not 30 features. Feature engineering is the process of transforming raw telemetry into numerical representations that capture the behavioral signals a model can learn from. In security data, this is where domain expertise earns its keep.

Process Execution Features

Consider a process creation event. The raw fields include parent process name, child process name, command-line arguments, user, timestamp, and working directory. Useful features extracted from these fields include:

Parent-child rarity score. How often has this specific parent-child process pair appeared in the baseline? explorer.exe spawning chrome.exe is common. wmiprvse.exe spawning powershell.exe is less so. winword.exe spawning cmd.exe is worth investigating.

Command-line entropy. Encoded payloads, base64 strings, and randomized filenames produce high-entropy command lines. Normal administrative commands tend to have lower entropy. Shannon entropy of the command-line string is a cheap, effective feature.

Execution hour deviation. The distance between this event's timestamp and the user's typical execution times for this process. A developer running git at 2 PM is unremarkable. A finance user running net.exe at 2 AM is a different conversation.

Argument length percentile. Where does this command line fall in the length distribution for its process? Unusually long command lines often indicate encoded payloads, LOLBin abuse, or scripted automation that departs from interactive use.

DNS Features

DNS is a rich source of features for both anomaly detection and direct threat identification:

Domain character entropy. Algorithmically generated domains (DGA) have measurably higher character entropy than human-registered domains. A domain like kj3x9qm2vb.net is statistically distinguishable from docs.google.com.

Query rate deviation. A host that normally makes 200 DNS queries per hour suddenly making 2,000 may be performing reconnaissance, exfiltrating data via DNS tunneling, or communicating with a DGA-based C2.

TXT record ratio. TXT record queries are normal but typically comprise a small fraction of total DNS traffic. A significant increase in TXT queries from a single host is a common indicator of DNS-based data exfiltration.

Domain age at first query. Newly registered domains queried by internal hosts are disproportionately associated with malicious activity. This feature requires enrichment from WHOIS or passive DNS data, but it is one of the most reliable signals in DNS-based detection.

Key term: Feature engineering The process of transforming raw data fields into numerical representations that capture the behavioral signals relevant to a detection task. In security ML, feature engineering is where analyst domain knowledge -- understanding what makes activity suspicious -- gets encoded into a form that statistical models can work with. A model is only as good as its features.

Supervised vs. Unsupervised Approaches

The choice between supervised and unsupervised learning is not a philosophical preference. It is determined by what data you have.

Supervised learning requires labeled data: examples of both benign and malicious activity, tagged as such. If you have a corpus of labeled malware samples, a dataset of confirmed phishing emails, or a collection of verified attack chains from past incidents, supervised methods can learn the decision boundary between classes. Supervised models tell you: "this looks like the malicious examples I was trained on."

Unsupervised learning does not require labels. It learns the structure of the data itself and flags observations that deviate from that structure. If you do not have labeled attack data -- and for most novel threat detection, you do not -- unsupervised methods are your entry point. Unsupervised models tell you: "this looks different from everything else."

The practical landscape looks like this: use supervised methods when you have reliable labels and are looking for known categories of threats. Use unsupervised methods when you are hunting for unknowns or when labeled data is scarce or unreliable. Most production security ML pipelines use both.

Fig. 02 -- Supervised vs. unsupervised detection approaches

Supervised methods need labels but deliver precise classification. Unsupervised methods need only a baseline but produce noisier output. Production systems combine both, using unsupervised anomaly detection to surface candidates and supervised classifiers to triage them.

TensorFlow and Keras for Security

The model architectures that matter for security ML are not exotic. You do not need transformers with billions of parameters. You need architectures matched to the structure of your data and the nature of your detection task. Three architectures cover the vast majority of practical security use cases.

Autoencoders for Anomaly Detection

An autoencoder is a neural network that learns to compress input data into a lower-dimensional representation and then reconstruct it. Train the autoencoder on normal data only. At inference time, feed it new observations. If the reconstruction error is low, the observation is similar to the training distribution -- normal. If reconstruction error is high, the observation is dissimilar -- anomalous.

This architecture is well-suited to security because it solves the label problem. You do not need examples of attacks. You need examples of normal operations, which you have in abundance. The autoencoder learns the manifold of normal behavior. Anything off that manifold gets a high reconstruction error.

In practice, a dense autoencoder with 3-5 layers, trained on feature vectors from process execution or authentication logs, produces useful anomaly scores with modest data and compute requirements. The encoder compresses your feature vector (say, 50 dimensions of process execution features) down to a bottleneck of 8-12 dimensions, and the decoder reconstructs the original 50. The reconstruction loss on each observation is your anomaly score.

A practical Keras implementation looks like this:

from tensorflow import keras

input_dim = 50  # feature vector length
encoding_dim = 10

encoder = keras.Sequential([
    keras.layers.Dense(32, activation='relu', input_shape=(input_dim,)),
    keras.layers.Dense(encoding_dim, activation='relu'),
])
decoder = keras.Sequential([
    keras.layers.Dense(32, activation='relu', input_shape=(encoding_dim,)),
    keras.layers.Dense(input_dim, activation='sigmoid'),
])

autoencoder = keras.Model(encoder.input, decoder(encoder.output))
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(X_normal, X_normal, epochs=50, batch_size=256,
                validation_split=0.1, shuffle=True)

At inference time, compute the mean squared error between input and reconstruction. Set your alerting threshold at a percentile of the training reconstruction errors -- the 99th percentile is a common starting point, tuned based on your false positive tolerance.

LSTMs for Sequence Modeling

Many security-relevant behaviors are sequential. A process chain is a sequence. A series of authentication events is a sequence. An attacker's lateral movement path is a sequence. Recurrent architectures like LSTMs capture temporal dependencies that feed-forward networks miss.

The practical application is process chain modeling. Train an LSTM on sequences of process creation events (encoded as integer tokens or embedding vectors) from normal activity. The model learns to predict the next process in a chain given the preceding processes. At inference time, if the model assigns low probability to the observed next process, that transition is anomalous.

For example, a model trained on normal workstation activity learns that outlook.exe -> chrome.exe is common and explorer.exe -> notepad.exe is common. When it sees winword.exe -> cmd.exe -> powershell.exe -> whoami.exe, the probability drops at each transition. The cumulative low probability flags the chain for review.

CNNs for Malware Image Classification

This technique treats binary files as images. Map the raw bytes of an executable into a 2D grayscale image (each byte value becomes a pixel intensity). Different malware families produce visually distinct textures because their code structure, packing methods, and embedded resources differ systematically.

A CNN trained on these malware images can classify malware into families, detect packing, and identify structural similarities between samples that share no common hash or signature. This approach was pioneered by Nataraj et al. and has been reproduced with datasets like BODMAS, Malimg, and Microsoft BIG 2015.

The architecture is standard image classification: two or three convolutional layers with max pooling, followed by dense layers and a softmax output. The unusual part is the input representation, not the model.

Remember: The model architecture is rarely the hard part. Feature engineering, data quality, and operational integration are where security ML projects succeed or fail. A simple autoencoder with well-engineered features will outperform a complex transformer with poor features every time.

NLP Applied to Threat Intelligence

Security teams drown in unstructured text. Threat intelligence reports, vendor advisories, blog posts, OSINT feeds, incident summaries -- all written in natural language, all containing actionable information locked behind prose. NLP provides the tools to extract structure from that text at scale.

IOC Extraction

The simplest NLP application in security is extracting indicators of compromise from unstructured reports. IP addresses, domain names, file hashes, URLs, CVE identifiers, and email addresses can be extracted with regular expressions. Named entity recognition (NER) models handle the harder cases: malware family names, threat actor names, targeted industries, and technique descriptions.

Tools like ioc-finder, iocextract, and spaCy with custom NER models handle this task. The practical challenge is not extraction -- it is deduplication, defanging reversal, context preservation, and confidence scoring. An IP address mentioned as a sinkhole is not the same as an IP address mentioned as C2 infrastructure. Context-aware extraction requires either rule-based post-processing or a classification layer on top of the NER output.

Technique Classification Against MITRE ATT&CK

Given a paragraph describing adversary behavior, which ATT&CK technique does it describe? This is a text classification task. Supervised approaches work well here because the ATT&CK framework provides a taxonomy with descriptions and examples that can be used to generate training data.

A fine-tuned BERT model or a sentence transformer with a classification head, trained on ATT&CK technique descriptions augmented with real-world report excerpts, can map new report text to technique IDs with reasonable accuracy. The practical value is automated tagging: every report that enters your TIP gets tagged with the techniques it describes, enabling structured search across unstructured sources.

The TRAM (Threat Report ATT&CK Mapper) project from MITRE's Center for Threat-Informed Defense is an open-source implementation of exactly this approach. It uses sentence-level classification to map report text to ATT&CK techniques and is a reasonable starting point for teams building this capability.

Automated Report Summarization

A 40-page APT report contains perhaps two pages of information relevant to your specific environment. Extractive summarization -- selecting the most informative sentences -- helps analysts triage reports faster. Abstractive summarization -- generating new summary text -- is riskier in a security context because hallucinated details can create false leads, but it is useful for generating executive-level summaries from technical reports.

The practical approach is extractive summarization with a relevance filter tuned to your environment. Score sentences by their relevance to your industry, your technology stack, and your threat model. Present the top-scoring sentences as a summary. Let the analyst decide whether to read the full report.

Key term: Zero-shot classification A technique where a model classifies text into categories it was not explicitly trained on. Instead of training on labeled examples for each category, the model uses its general language understanding to match text against category descriptions. Applied to security: classify a threat report as "ransomware," "espionage," or "supply chain compromise" by providing those labels as prompts, without training examples for any of them.

Zero-Shot Classification

Zero-shot classification using large language models is the most accessible entry point for teams that lack labeled training data. The premise: instead of training a classifier from scratch, use a pre-trained language model's existing knowledge to classify text against categories you define at inference time.

In practice, this means using a model like facebook/bart-large-mnli via the Hugging Face transformers pipeline, or prompting an LLM API with a classification task. You provide the text (a threat report excerpt, an alert description, a log message) and a set of candidate labels ("credential theft," "lateral movement," "data exfiltration," "benign"). The model returns a probability distribution over the labels.

This approach has real utility and real limits. The utility: you can prototype a classification system in hours rather than weeks. No training data collection. No model training. No GPU time. The limits: accuracy is lower than a fine-tuned supervised model, the model has no knowledge of your specific environment, and prompt engineering becomes a critical skill that is harder to systematize than training data curation.

For threat report triage -- classifying incoming reports by relevance, severity, or threat category -- zero-shot classification is a reasonable production approach. For high-stakes detection decisions, it is a prototyping tool, not a production system.

Prompt Design for Security Classification

The quality of zero-shot classification depends heavily on how you frame the task. Vague labels produce vague results. Specific, descriptive labels produce better classification. "Malicious" vs. "benign" is a poor framing. "This text describes an attacker using stolen credentials to access a system they are not authorized to use" vs. "This text describes routine system administration activity" is a better framing.

Include domain context in the prompt. A generic language model does not know that certutil -urlcache is suspicious. If your classification depends on domain-specific knowledge, provide that knowledge in the prompt or switch to a fine-tuned model that has it.

Building a Detection Pipeline

Individual models are components. A detection pipeline is the system that turns raw telemetry into analyst-ready alerts. The pipeline has six stages, and failure at any stage defeats the entire system.

Fig. 03 -- ML detection pipeline architecture

The six-stage detection pipeline. Stages 1-4 are automated. Stage 5 is human. Stage 6 closes the loop -- analyst dispositions become training data for the next model iteration. Without stage 6, the pipeline degrades over time.

Stage 1: Data ingestion. Normalize heterogeneous log formats into a common schema. Align timestamps across sources. Deduplicate events that appear in multiple collection paths. This is plumbing work. It is also where most pipeline failures originate. A model trained on one timestamp format and fed another will produce garbage.

Stage 2: Feature extraction. Transform normalized events into feature vectors. This is the stage where your domain expertise lives. Precompute expensive features (rolling averages, peer group statistics) in batch. Compute cheap features (entropy, length, time-of-day) in real time.

Stage 3: Model inference. Run feature vectors through trained models. In production, this means serving models via TensorFlow Serving, TorchServe, or a simpler Flask/FastAPI wrapper. Latency matters here: if inference takes longer than your ingestion rate, you build a backlog. For most security ML workloads, inference is fast enough that a single GPU or even a CPU serves adequately.

Stage 4: Alert generation. Apply thresholds to model output, enrich with contextual data (asset criticality, user role, historical alert frequency for this entity), deduplicate, and generate tickets. This stage is where the precision/recall tradeoff becomes concrete.

Stage 5: Analyst review. A human looks at the alert and decides: true positive, false positive, or needs further investigation. This is not optional. ML without human review is an alarm system with no one listening.

Stage 6: Feedback loop. Analyst dispositions flow back into the pipeline. True positives confirm the model is working and can become supervised training examples. False positives identify where the model is noisy and where thresholds or features need adjustment. Without this stage, your pipeline is a static artifact that will degrade as the environment changes.

False Positive Management

This is where most security ML projects die. Not because the model is inaccurate in a statistical sense, but because the operational cost of false positives exceeds the analyst capacity to handle them.

Consider the arithmetic. Your environment generates 10 million security-relevant events per day. Your model has a 99% true negative rate and a 95% true positive rate. Those sound like excellent numbers. They are not. A 1% false positive rate on 10 million events is 100,000 false alerts per day. Even if you only surface the highest-confidence ones, you are burying your SOC.

The denominator problem is fundamental to security ML. In any real environment, the base rate of genuinely malicious activity is extremely low relative to total event volume. Even a very good model produces more false positives than true positives in absolute numbers because the class distribution is so skewed.

Remember: A 99% accurate model on a dataset where 0.01% of events are malicious produces 100 false positives for every true positive. Precision -- the fraction of alerts that are true positives -- is the metric that determines whether your SOC can operate. Accuracy is nearly meaningless for security classification at scale.

The practical mitigations are layered:

Threshold tuning per entity class. Do not use one threshold for all hosts. Servers, workstations, and infrastructure appliances have different baseline volatility. A threshold that works for servers will flood you with false positives from workstations.

Multi-signal correlation. Require anomalies in multiple independent signals before alerting. A single anomalous DNS query is noise. An anomalous DNS query from a host that also had an anomalous authentication event and an unusual process execution in the same window is signal.

Contextual suppression. Integrate asset inventory, change management, and maintenance window data. A spike in administrative tool usage during a scheduled patching window is not an anomaly worth investigating.

Progressive alerting. Rather than binary alert/no-alert, score and rank anomalies. Surface the top N to analysts each shift. Let the less-confident anomalies accumulate in a hunting queue that analysts work during dedicated hunting time, not during alert triage.

Adversarial ML

Attackers are not passive targets of your detection models. Once ML-based detection becomes prevalent, adversaries adapt. Understanding how ML models can be evaded is necessary for building durable detection systems.

Evasion Attacks

Evasion attacks modify malicious inputs to avoid detection while preserving malicious functionality. For a malware classifier, this means modifying the binary in ways that change the feature vector (appending benign code sections, altering headers, repacking) without changing what the malware does. For a network-based detector, this means altering traffic patterns (randomizing beacon intervals, padding packet sizes, mimicking legitimate protocol behavior) to stay within the model's learned normal distribution.

The defense against evasion is feature diversity. A model that relies on a single feature is easy to evade: change that feature. A model that uses 50 features across multiple data sources is harder to evade because the attacker must simultaneously control all of them. Defense in depth applies to ML just as it does to traditional security architecture.

Poisoning Attacks

Poisoning attacks target the training data rather than the inference input. If an attacker can introduce malicious activity during the baseline period -- when the model is learning what "normal" looks like -- that activity gets incorporated into the baseline and will not be detected later.

This is not theoretical. An attacker with persistent access who observes that a new ML system is being deployed can deliberately increase their activity during the training window, establishing their malicious behavior as part of the baseline. When the model goes live, it has already learned to tolerate the attacker's patterns.

Mitigations include curating training data with known-clean periods, using resistant statistical methods that down-weight outliers during training, and cross-validating baselines against threat intelligence to ensure known-bad patterns are not present in the training set.

Concept Drift

Even without adversarial action, models degrade over time as the environment changes. New software deployments, organizational changes, infrastructure migrations, and seasonal business patterns all shift the data distribution away from the training distribution. This is concept drift.

The detection of drift is itself a monitoring task. Track the distribution of model scores over time. If the mean anomaly score is increasing -- meaning more and more normal activity is being flagged as anomalous -- the model has drifted. Track the false positive rate over time. If it is climbing, either the model or the environment has changed, and retraining is needed.

A practical retraining cadence depends on the volatility of your environment. Monthly retraining is a reasonable starting point for most organizations. High-change environments may need weekly retraining. Low-change environments can stretch to quarterly. But every model needs a scheduled retraining cadence, monitored drift metrics, and a defined process for when drift is detected between scheduled retraining.

Key term: Concept drift The phenomenon where the statistical relationship between input features and the target variable changes over time. In security ML, concept drift occurs when the environment evolves (new applications, new users, infrastructure changes) and the model's learned baseline no longer accurately represents normal behavior. Unchecked concept drift turns a useful model into a false positive generator.

What ML Cannot Replace

ML is a tool. It is a powerful tool. It is not a replacement for the things that make threat hunting effective as a discipline.

Analyst intuition. A senior analyst looking at a set of events can integrate context that no model has access to: institutional knowledge about ongoing projects, relationships between teams, the political dynamics that make a particular insider threat plausible. That integration of soft context with hard data is not something current ML systems do.

Novel threat recognition. An unsupervised model can tell you something is different. It cannot tell you why it is different or whether the difference represents a threat. The judgment call -- is this deviation malicious, misconfigured, or irrelevant? -- requires a human who understands both the technology and the threat landscape.

Hypothesis-driven hunting. ML-based detection is reactive: it processes data and surfaces anomalies. Hypothesis-driven hunting is proactive: an analyst formulates a theory ("I think an attacker could abuse our Jenkins pipeline to move from CI to production") and designs queries to test it. That creative, adversarial thinking is the highest-value activity in threat hunting, and it is entirely human.

Attribution and intent. Even a perfect detection model tells you what happened, not who did it or why. Attribution requires intelligence analysis -- linking technical indicators to threat actors, understanding geopolitical context, assessing operational patterns across campaigns. Intent requires understanding motive, which is a human judgment informed by context that no feature vector captures.

The right mental model is ML as a force multiplier for analysts, not a replacement. ML handles the volume problem: no human can review 10 million events per day. ML surfaces the small fraction that warrant human attention. The human provides judgment, context, and creative thinking that the model cannot.

A Minimum Viable Detection Pipeline

If you are starting from zero, here is a concrete path to a working ML detection pipeline. This is not the optimal pipeline. It is the simplest pipeline that produces real value and can be iterated on.

Data: Start with DNS

DNS logs are the easiest telemetry source to collect at scale. Every organization has them. They require minimal normalization. And DNS-based threats -- DGA domains, DNS tunneling, suspicious resolutions -- are common enough that your model will find real things.

Collect DNS query logs from your recursive resolvers. If you run your own, enable query logging. If you use a cloud DNS service, enable logging there. You need: timestamp, source IP, queried domain, query type, response code. That is five fields.

Features: Keep It Simple

For your first iteration, extract four features per domain queried:

Character entropy -- Shannon entropy of the domain string. DGA domains and tunneling domains have measurably higher entropy than human-registered domains.
Domain length -- total character count of the FQDN. Long domains correlate with encoded payloads and tunneling.
Subdomain depth -- number of labels in the domain. Deep subdomain trees are common in tunneling and uncommon in legitimate traffic.
Consonant ratio -- fraction of characters that are consonants. DGA domains tend to have unusual letter distributions compared to English-language domains.

Four features. All computable with basic string operations. No enrichment required.

Model: Isolation Forest

For the first iteration, skip neural networks entirely. Use an Isolation Forest from scikit-learn. It is an unsupervised anomaly detection algorithm that works well on tabular data with a small number of features. It requires no GPU, trains in seconds, and produces interpretable anomaly scores.

from sklearn.ensemble import IsolationForest
import numpy as np

# X_train: feature matrix from baseline DNS data
model = IsolationForest(
    n_estimators=200,
    contamination=0.01,  # expected anomaly fraction
    random_state=42
)
model.fit(X_train)

# Score new observations: -1 = anomaly, 1 = normal
scores = model.decision_function(X_new)
predictions = model.predict(X_new)

Train on a week of DNS data from your environment. Score new queries in real time or in hourly batches. Flag the lowest-scoring domains for analyst review.

Alert and Review

Generate a daily digest of the top 50 most anomalous domains queried in your environment. Include the source IP, the domain, the feature values, and the anomaly score. Have an analyst review the digest. Record the disposition: true positive (actual threat), false positive (legitimate but unusual), or inconclusive.

After four weeks of this process, you will have: a working pipeline, a labeled dataset of analyst dispositions, a clear picture of your false positive patterns, and enough experience to decide whether to invest in a more sophisticated system.

Iterate

From this starting point, the natural evolution is: add more features (domain age enrichment from WHOIS, first-seen tracking, query frequency aggregates), switch to a supervised model once you have enough labeled data, expand to additional data sources (process execution, authentication), and implement automated threshold tuning based on analyst feedback.

Each step is incremental. Each step produces measurable improvement. And each step is grounded in operational reality rather than theoretical capability.

Fig. 04 -- Minimum viable pipeline: DNS anomaly detection

The minimum viable ML detection pipeline. DNS logs, four features, an Isolation Forest, and a daily digest for analyst review. This is deployable in a day using open-source tools and produces real findings within the first week.

Remember: Start with the simplest pipeline that produces value. A daily digest of 50 anomalous DNS domains, reviewed by a human, will find real threats in most environments. Build from there. The teams that succeed with security ML are the ones that ship something small, learn from it, and iterate -- not the ones that spend six months building a comprehensive platform before producing their first alert.

Model Monitoring and Retraining

A deployed model is not a finished product. It is a living system that requires monitoring, maintenance, and periodic retraining. The two failure modes are gradual degradation (concept drift) and sudden failure (data pipeline changes, schema changes, infrastructure failures).

Monitor model score distributions. Plot the distribution of anomaly scores weekly. If the distribution is shifting -- mean score increasing, variance changing, bimodal distributions appearing -- the model is drifting relative to the data.

Monitor false positive rate. Track the percentage of alerts dispositioned as false positive over time. A rising FP rate is the most operationally visible signal that retraining is needed.

Monitor data pipeline health. A model that receives no input produces no output but also no errors. Monitor ingestion volume, feature completeness, and inference throughput. A silent pipeline failure is worse than a noisy model.

Version everything. Every model version gets a unique identifier, a record of its training data, its hyperparameters, and its validation metrics. When you deploy a new version, keep the previous version available for rollback. When you retrain, compare the new model's validation performance against the previous version before deploying.

Retraining cadence depends on your environment's rate of change. A reasonable default: retrain monthly, validate against the previous month's analyst-labeled data, deploy if validation metrics improve or hold steady, investigate if they degrade.

Closing Thoughts

ML and NLP are not magic. They are engineering. They require the same discipline as any other engineering system: clear requirements, good data, appropriate architecture, rigorous testing, operational monitoring, and continuous maintenance.

The teams that get value from security ML are not the ones with the most sophisticated models. They are the ones with the cleanest data pipelines, the most disciplined feature engineering, the tightest analyst feedback loops, and the most realistic expectations about what the technology can and cannot do.

Start small. Start with data you already have. Start with a model you can explain to your SOC analysts. Ship something that produces real alerts within a week. Then iterate. The detection gap will not close in a single project. But every well-built model, every tuned threshold, every feedback loop that turns an analyst's judgment into better training data -- each one narrows the gap a little further.

That is the work. It is not glamorous. It does not make good vendor slides. But it is the work that actually improves detection outcomes in environments where signatures alone are no longer enough.