Data, in its most natural and original state, is all over. Talk to any cybersecurity analyst working the SOC and you will find that they are dealing with data all over; in multiple tools in the cloud, traditional on-prem, or in their vendors’ SaaS. This is true irrespective of whether it’s a 100 employee organization or a larger one. In fact, large organizations also have to deal with geographically distributed data, edge data, and multiple network tiers.

The common types of cybersecurity-relevant data here are network data, identity-authentication-authorization information, end-point activity, cloud-infrastructure logs, server logs, threat intelligence, vulnerability scan results, etc.

My company’s innovation is federated searching all of the above data in parallel, but I am not here to talk about that. Rather, I want to talk about the common industry problem that we also faced while designing our distributed query engine — how to transform and model cybersecurity data to understand and analyze it. The problem of making sense of the data is compounded with the heterogeneity and is a common problem that analysts, builders, and vendors face.

Building your own or adopting a vendor’s proprietary model? Use OCSF instead!

Platform teams building their own data lakes face the choice of:

  • (blue pill): just copy the data in the vendors’ format knowing well that it is not going to be useful for any kind of unified analytics.
  • (red pill): try to do the work of parsing and normalizing into a structure so that you can gain valuable insights.

A quote relevant for the above choice is that, “There is no AI without IA” (Rob Thomas, SVP IBM). The foundational work needs to be about building the right Information Architecture, i.e. normalizing, modeling, and correlating the data to make it useful/actionable for the SOC analyst. At Query, our first data schema was built with our own nascent models, but we quickly realized that we needed to work with the industry vs. building our own. Almost by stroke of luck, the timing aligned and OCSF was born right when we needed it.

Open Cybersecurity Schema Framework (OCSF)

OCSF was announced recently at BlackHat 2022 with the cyber industry bigwigs Splunk, AWS, CrowdStrike, IBM, Okta, Palo Alto, ZScaler, and more as its participants. Amazon just announced their Security Lake built on OCSF schema (more on that in a separate blog I will be writing). Our team joined OCSF’s Slack community to collaborate during the development of the draft schema as we become one of its earliest adopters.

Cybersecurity data models have existed earlier — Splunk’s Common Information Model (CIM) is well known. What makes OCSF special and very powerful is its object modeling schema and ability to associate objects to relevant activity or security events. It has standardized object definitions with relevant attributes for key objects like User, Device, File, Process, Network Endpoint, Domain Info, and several more. The actual activity itself is modeled via Event Classes that reference the above objects. Similar Event Classes are grouped into Categories.

Walk-thru Scenario — Modeling Authentication Data

To understand OCSF better, let’s go through a use-case. Suppose you need a way to understand authentication information coming from across all your toolsets; probably the most common analyst need indeed. How would you like that information to be represented in a common model so that you can understand it irrespective of which tool generated the original data?

Here comes the Authentication Class from OCSF — an event class that represents any successful/unsuccessful logon/logoff activity. To model it, here are some key attributes we should care about (also you can scroll below to see example data in red for these attributes and more):

User

The user account for which the authentication event happened.

Activity

The activity_id attribute to represent whether it was a:

	
1 Logon
2 Logoff
3 Kerberos Authentication Ticket Requested
4 Kerberos Service Ticket Requested
…

Logon Type

The logon_type_id tells you the type of logon such as:

	
0 System
2 Interactive
3 Network
…

Endpoint

The dst_endpoint attribute is of type Network Endpoint object identifying the endpoint for which the authentication event was targeted. There is an optional src_endpoint as well if there was a known remote source. Note that Network Endpoint is an Object, i.e. a complex attribute with its own set of defined fields and values. This is an important distinction as it enables nesting of standardized object types vs. other standards like CIM, CEF, etc. that are very flat. The flat standards have the disadvantage that one gets lost in the sea of field names that hold IP Address, for example.

Authentication Protocol

The auth_protocol_id tells what authentication protocol was used, such as:

	
1 NTLM
2 Kerberos
4 OpenID
5 SAML
6 OAUTH 2.0
10 RADIUS
…

Time

The time attribute captures the Timestamp of when the authentication happened.

Category

The category_uid integer 3 represents Audit Activity. This is for the obvious reason that authentication information is audit data and not an alert or another category of data.

Message

The message attribute captures a friendly textual description from the event source.

Authentication Event Example – Windows EventCode 4624

How would the above authentication event look like for a real-world event? Probably the most relevant example would be the commonly known Windows Event Code 4624 for when an account is successfully logged in. The raw Windows event looks like below (abbreviated for space / relevance):

	
05/13/2022 11:58:15 AM
LogName=Security
SourceName=Microsoft Windows security auditing.
EventCode=4624
EventType=0
Type=Information
ComputerName=win-dc-324.queryai.local
TaskCategory=Logon
Message=An account was successfully logged on.
…

Subject:
Security ID: NT AUTHORITY\SYSTEM
Account Name: WIN-DC-324$
Account Domain: QUERYAI
Logon ID: 0x5F8

Logon Information:
Logon Type: 5
Restricted Admin Mode: -
Virtual Account: No
Elevated Token: Yes

Impersonation Level: Impersonation

New Logon:
Security ID: NT AUTHORITY\SYSTEM
Account Name: SYSTEM
Account Domain: NT AUTHORITY
Logon ID: 0x5F8
Linked Logon ID: 0x0
Network Account Name: -
Network Account Domain: -
Logon GUID: {00000000-0000-0000-0000-000000000000}

Process Information:
Process ID: 0x253
Process Name: C:\Windows\System32\services.exe

Network Information:
Workstation Name: -
Source Network Address: -
Source Port: -

Detailed Authentication Information:
Logon Process: Advapi  
Authentication Package: Negotiate
Transited Services: -
Package Name (NTLM only): -
Key Length: 0

...

OCSF would be the vendor agnostic standardized way to represent above, especially when you want to correlate information across vendors. Here is what it would look like in OCSF (abbreviated for space / relevance):

	
{
    "activity_id": 1,
    "actor": {
        "process": {
            "file": {
                "name": "services.exe",
                "parent_folder": "C:\\Windows\\System32",
                "path": "C:\\Windows\\System32\\services.exe",
                "type_id": 1
                },
            "pid": 253
            },
    "user": {
        "account_type": "Windows Account",
        "account_type_id": 2,
        "domain": "QUERYAI",
        "name": "WIN-DC-324$",
        "session_uid": "0x5F8",
        "uid": "NT AUTHORITY\\SYSTEM"
        }
    },
    "auth_protocol": "Negotiate",
    "auth_protocol_id": -1,
    "category_uid": 3,
    "device": {
        "name": "win-dc-324.queryai.local",
        "os": {
            "name": "Windows",
            "type_id": 100
            },
        "type_id": 0
        },
    "dst_endpoint": {
        "hostname": "win-dc-324.queryai.local"
        },
    "logon_process": {
        "name": "Advapi  ",
        "pid": -1
        },
    "logon_type_id": 5,
    "message": "An account was successfully logged on.",
    "metadata": {
        "original_time": "05/13/2022 11:58:15 AM",
        "product": {
            "feature": {
                "name": "Security"
                },
            "name": "Microsoft Windows",
            "vendor_name": "Microsoft"
            },
        "version": "0.31.2"
        },
    "severity_id": 1,
    "src_endpoint": {
        "ip": "-",
        "name": "-",
        "port": 0
        },
    "status_id": 1,
    "time": 1652443095000,
    "unmapped": {
        "EventCode": "4624",
        "EventType": "0",
        "OpCode": "Info",
        "SourceName": "Microsoft Windows security auditing.",
        "TaskCategory": "Logon"
        },
    "user": {
        "account_type": "Windows Account",
        "account_type_id": 2,
        "domain": "NT AUTHORITY",
        "name": "SYSTEM",
        "session_uid": "0x5F8",
        "session_uuid": "{00000000-0000-0000-0000-000000000000}",
        "uid": "NT AUTHORITY\\SYSTEM"
        }
}

Summary

We looked at the need for a standard cybersecurity data model, whether for analytics, common storage, or other use-cases. Then, we got an introductory flavor of OCSF; a new community standard. We saw how OCSF leads to normalized and standardized JSON representation across all vendors and is well-suited for transmission, storage, indexing, searching, programmatic processing, and more. And finally, to understand OCSF via an example, we reviewed how Windows Event Code 4624 for login looks like in its raw format and how it gets transformed into OCSF’s JSON.

For more information on OCSF, go to https://ocsf.io/ and https://schema.ocsf.io/.

Happy reading!