Data Confidence

Oct 20, 2021

How many sensors do you have?

One concern about our Insights data that has been brought to my attention several times is whether or not we have enough sensors installed for the data to be reliable and useful. That got me thinking about data quality as well as confidence levels that might be assigned to the data itself. Compared to the incalculable number of processes running on all the computers around the world, we are sampling only a very small portion of those processes, monitoring their behaviors, and then extrapolating those behaviors out into a statistical model. Similar to a researcher who surveys only a portion of the population, and then extrapolates that out to the whole population using a tried and true statistical model, where they can assign a confidence level that states that most people would respond similarly, given the size of their sample set, plus or minus a margin of error, known as the confidence interval.

Sampling

We are essentially doing the same by sampling how processes behave on only a few hundred endpoints (that number will continue to grow), knowing that there are millions of endpoints out there that we cannot monitor due to practical limitations like the laws of physics. There are of course some differences between a researcher conducting a survey of humans and our sensors monitoring process behavior. One advantage we have is that the processes won't lie or experience bias or be influenced by the way a question is asked. Even with only a few hundred endpoints we often see processes running millions of times over months or years, so our confidence that a given process will continue to behave as it has in the past reaches a high level very quickly, just by the sheer volume of execution. That said, because there are 100's millions of computers out there running processes and we are only sampling a very small percentage of them, we may miss out on the full breadth, or put another way, the variety of manners in which processes can behave and interact with one another, however, that doesn't diminish the quality of the data we do have, only the breadth, as I'll explain shortly. While a given executable, identified concretely by its hash, will tend to perform the same way each time, the way the user interacts with that program, the way that particular computer is set up, whether there is an internet connection available, etc., will create a wider range of possible behaviors.

Tails of great length

What we tend to find when we look at this breadth or variety of behaviors is that they sometimes end up creating a long tail. For instance, we have observed conhost.exe being launched by 350+ different parents. Now, one quick point to make first is that the parent of a process really can't be considered a behavior of that process because the child has no control over who launches it. It's funny how similar that is to life; none of us had any control over the family we were born into, we can only control how we respond to our circumstances. The same is true for a process, the code that determines how that process behaves does not influence how it is launched, only what it does once it's up and running. So, in terms of process behavior, it's better to think about what resultant behaviors we see when a process is running, such as child processes launched, network connections made, registry modifications made, files written, etc. Things like parents, grandparents, path executed from and even filename are not in the control of a given process but simply outside influences that may or may not affect how a process behaves.

Now back to the original point about long tails. Once a tail of data gets to a certain length, the data within starts to become a little less interesting, while the idea that there is just a lot of variety for that given process is what starts to become more interesting. In other words, once we see the list of different parents of conhost.exe start to grow longer, we can conclude that any parent we see in our environment whether it's on the list that EchoTrail has captured or not, is not a strong indicator that something is wrong. In other words, variety is fairly normal in that particular parent-child relationship. Conhost.exe is oft-used by many different processes and that is expected behavior. Once we've reached that conclusion about that dataset, adding more variety to the list doesn't change anything about our conclusion, it simply adds breadth.

Not so lengthy tails

To contrast long tails, we also have many situations where we have a very short tail, or really no tail at all, just a little nub. For example, userinit.exe has only been observed launching 3 different child executables. Userinit.exe typically runs at startup in Windows and we've seen it execute thousands of times over a period of years and we've only ever seen it launch three different children, and one of those children is launched in almost 98% of the cases. So, here we can draw a completely different conclusion. Userinit.exe's behavior is extremely predictable and if we see it launching something other than explorer.exe, then that is certainly worth investigating as potentially malicious. With userinit.exe serving a fairly narrow purpose in the Windows operating system, we can also see that its list of parents and grandparents are both fairly short and sweet, with some obvious conclusions to be drawn as to what's normal and what falls outside of normal and perhaps even potentially malicious.

In these short tail scenarios with one or two dominating data points for each category, it doesn't take as many samples or endpoints to reach a high confidence level. In contrast to long tails, where we need more samples to achieve a similar high confidence level. Using userinit.exe as an example, using standard formulas for statistical accuracy, in this case, we have observed userinit.exe executing enough times to have a 99% confidence that its child process will be explorer.exe 97.88% of the time, plus or minus 0.24%. That's a pretty high confidence level. Contrast that with connhost.exe's parents and our standard deviation is very high because there are just so many variations from computer to computer, that our confidence in any one parent for connhost.exe is fairly low. In fact, the only thing we are confident in with connhost's parents is that there will be variety.

Closing thoughts

Well, I started off talking about data quality and confidence scores and have strayed a little into some nuanced areas, but my conclusion is this: once we have observed a process run and we've observed its exhibited behaviors hundreds or thousands of times, on hundreds of endpoints over months or years, and those behaviors fit into a fairly tight and predictable pattern, then we can easily have strong confidence that any behaviors observed outside of that tight pattern are probably worth our time to investigate. On the other hand, when we see a huge variety of behaviors from a given process, we can also draw a similar but different conclusion, that variety is normal in that case. Using all of this data as context, as we triage alerts from an EDR product or any security product that provides some process information, we can quickly come to conclusions as to whether what we are looking at falls within the range of normal, or perhaps unusual or rare and worth a closer look.

We believe these confidence levels will be particularly helpful for analysts making decisions about alerts. Therefore, we will be adding confidence scores to our dataset shortly. Our dataset has many executables that we've seen execute millions of times where we can ascribe a high confidence level to the behaviors we've observed. Conversely, we also have several executables that are extremely rare that we've seen launch only a handful of times. In those cases, we will tag that data as low confidence since we can't extrapolate that behavior out to the whole population with such a small sample size. However, even one sample can be useful for security analysis provided the confidence level is taken into account as conclusions are drawn.

I hope this helps shed some light on data quality and the confidence instilled by the volume of processes we've observed over time. Feel free to email me with any new questions that might arise from reading this at brian@echotrail.io.