🔐
SecWiki
  • Home
  • General
    • Interesting Links
      • Curriculum
    • Pentest Labs, Wargames Sites
      • How To Vulnhub with VirtualBox
  • Network Pentest
    • Courses
      • TCM - Zero to Hero
        • Week 1: Setup
          • ipsweep.sh
        • Week 2: Python 101
          • python101.py
          • bof.py
        • Week 3: Python 102
          • python102.py
          • scanner.py
        • Week 4: Passive OSINT
        • Week 5: Scanning Tools & Tactics
          • nmap
          • Nessus
          • msfconsole
        • Week 6: Enumeration
        • Week 7: Exploitation, Shells, and Some Credential Stuffing
        • Week 8: LLMNR/NBT-NS Poisoning
        • Week 9: NTLM
        • Week 10: MS17-010, GPP/cPasswords, and Kerberoasting
        • Week 11: File Transfers, Pivoting, Reporting
        • Commands
      • Penetration Testing Student (PTS)
      • OSCP Study
    • Recon
      • OSINT
    • Enumeration
      • Samba Shares
      • ProFtpd
    • Gaining Access
      • Reverse Shells
    • Privilege Escalation
      • Meterpreter
      • Spawning a TTY Shell
      • Reverse Shell Cheat Sheet
      • Cracking Hashes
      • Restricted Linux Shell Escape
      • Linux Privilege Escalation
        • lxd
        • sytemctl
      • Windows Privilege Escalation
        • Active Directory
          • What is AD?
        • User Enumeration
    • Post Exploitation
      • Cleanup
      • Maintaining Access
      • Pivoting
      • File Transfers
      • Covering Tracks
    • Vulnerabilities Checklist
    • Report Writing
  • Web App Pentest
    • Tools
      • Burp Suite
      • THC-Hydra BruteForce
    • Injection
      • SQL Injection
    • Broken Authentication
    • Sensitive Data Exposure
      • SQLite3
    • XML External Entity
      • XML Background
      • XPath Injection
    • Broken Access Control
    • Security Misconfiguration
    • Upload/Download
      • Download Bypass: Poison Null Byte
    • XSS
      • DOMXSS
      • Persistent XSS
      • Reflected (Client-side) XSS
      • Data URLs
    • Insecure Deserialization
    • Components with Known Vulnerabilities
    • Insufficient Logging and Monitoring
    • Server-Side Request Forgery (SSRF)
  • CTF
    • Intro to CTF
    • Forensics
      • Challenges
    • Steganography
    • Reverse Engineering
    • Tools
  • Network Security
    • Courses
      • Sec+
      • IBM Cybersecurity Analyst Professional Certificate
      • ISCI CNSS Course
        • Introduction to Network Security
          • Network Basics
          • Basic Network Utilities
          • The OSI Model
          • Threat Classification
          • Security Terminology
          • Approaches of Network Security
          • Law and Network Security
        • Types of Attacks
          • Denial of Service Attacks
          • Buffer Overflow Attacks
          • IP Spoofing
          • Session Hijacking
        • Fundamentals of Firewalls
          • What is a Firewall
          • Firewall Types
          • Firewall Implementation
          • Proxy Servers
          • Windows Firewalls
          • Linux Firewalls
        • Intrusion-Detection Systems
          • IDS Concepts
          • Components and Processes of IDS
          • Implementing IDS
          • Honeypots
        • Fundamentals of Encryption
          • The History of Encryption
          • Modern Encryption Methods
          • Windows and Linux Encryption
          • Hashing
          • Cracking Passwords
        • Virtual Private Networks (VPN)
          • Introduction to VPN
          • VPN Protocols
          • IPSec
          • SSL/TLS
          • VPN Solutions
        • Operating System Hardening
          • Configuring Windows
          • Configuring Linux
          • Operating System Patches
        • Virus Attacks and How to Defend
          • Virus Types and Attacks
          • Virus Scanners
          • Antivirus
          • Virus Infection and Identification
          • Trojan Horses
          • Spyware or Adware
        • Security Policies
          • User Policies Definition
          • System Administration Policies
          • Access Control
        • Assessing System Security
          • Risk Assessment
          • Conducting an Initial Assessment
          • Probing the Network
          • Vulnerabilities
          • Documenting Security
        • Security Standards
          • ISO Standards
          • NIST Standards
          • General Data Protection Regulation (GDPR)
          • PCI DSS
        • Physical Security and Recovery
          • Physical Security
          • Disaster Recovery
          • Fault Tolerance
        • Attackers Techniques
          • Hacking Preparation
          • The Attack Phase
          • Hacking Wi-Fi
    • The Web
    • The OSI Model
    • Malware Traffic Analysis with Wireshark
  • Digital Forensics
    • Autopsy - open-source digital forensics platform
  • Exploit Dev/Analysis
    • Code Review
      • Tools
    • Buffer Overflows
    • Static Analysis
      • Antivirus Scanning
      • Hashing
      • File strings
      • Packed and Obfuscated Malware
        • Demo: UPX
      • Portable Executable File Format (PE)
        • Tools
        • Linked Libraries and Functions
        • PE File Headers and Sections
  • Shell
    • ./missing-semester
      • Course overview + the shell
      • Shell Tools and Scripting
      • Editors (Vim)
      • Data Wrangling
      • Command-line Environment
    • Bash Tricks
    • .bashrc
    • Random Commands
      • sed
  • Hardware
    • NAND2Tetris
      • Boolean Functions and Gate Logic
      • Boolean Arithmetic and the ALU
      • Memory
      • Machine Language
      • Computer Architecture
      • Assembler
  • Other
    • K8s
      • Chapter 1: From Monolith to Microservices
      • Chapter 2: Container Orchestration
      • Chapter 3: Kubernetes
      • Chapter 4: Kubernetes Architecture
Powered by GitBook
On this page
  • Regular expressions and sed
  • awk
  • Analyzing data
  • Data wrangling to make arguments
  • Wrangling binary data

Was this helpful?

  1. Shell
  2. ./missing-semester

Data Wrangling

Regular expressions and sed

Regular expressions are common and useful enough that it’s worthwhile to take some time to understand how they work. Let’s start by looking at the one we used above: /.*Disconnected from /. Regular expressions are usually (though not always) surrounded by /. Most ASCII characters just carry their normal meaning, but some characters have “special” matching behavior. Exactly which characters do what vary somewhat between different implementations of regular expressions, which is a source of great frustration. Very common patterns are:

  • . means “any single character” except newline

  • * zero or more of the preceding match

  • + one or more of the preceding match

  • [abc] any one character of a, b, and c

  • (RX1|RX2) either something that matches RX1 or RX2

  • ^ the start of the line

  • $ the end of the line

sed’s regular expressions are somewhat weird, and will require you to put a \ before most of these to give them their special meaning. Or you can pass -E.

* and + are, by default, “greedy”

you can just suffix * or + with a ? to make them non-greedy, but sadly sed doesn’t support that. We could switch to perl’s command-line mode though, which does support that construct:

perl -pe 's/.*?Disconnected from //'

We can use “capture groups”. Any text matched by a regex surrounded by parentheses is stored in a numbered capture group. These are available in the substitution (and in some engines, even in the pattern itself!) as \1, \2, \3, etc. So:

sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'

sort -n will sort in numeric (instead of lexicographic) order. -k1,1 means “sort by only the first whitespace-separated column”. The ,n part says “sort until the nth field, where the default is the end of the line. In this particular example, sorting by the whole line wouldn’t matter, but we’re here to learn!

What if we’d like these extract only the usernames as a comma-separated list instead of one per line, perhaps for a config file?

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | awk '{print $2}' | paste -sd,

Let’s start with paste: it lets you combine lines (-s) by a given single-character delimiter (-d; , in this case). But what’s this awk business?

awk

awk is a programming language that just happens to be really good at processing text streams. There is a lot to say about awk if you were to learn it properly, but as with many other things here, we’ll just go through the basics.

First, what does {print $2} do? Well, awk programs take the form of an optional pattern plus a block saying what to do if the pattern matches a given line. The default pattern (which we used above) matches all lines. Inside the block, $0 is set to the entire line’s contents, and $1 through $n are set to the nth field of that line, when separated by the awk field separator (whitespace by default, change with -F). In this case, we’re saying that, for every line, print the contents of the second field, which happens to be the username!

Let’s see if we can do something fancier. Let’s compute the number of single-use usernames that start with c and end with e:

 | awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l

There’s a lot to unpack here. First, notice that we now have a pattern (the stuff that goes before {...}). The pattern says that the first field of the line should be equal to 1 (that’s the count from uniq -c), and that the second field should match the given regular expression. And the block just says to print the username. We then count the number of lines in the output with wc -l.

However, awk is a programming language, remember?

BEGIN { rows = 0 }
$1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 }
END { print rows }

Analyzing data

You can do math directly in your shell using bc, a calculator that can read from STDIN! For example, add the numbers on each line together by concatenating them together, delimited by +:

some_command(s) | paste -sd+ | bc -l

Or produce more elaborate expressions:

echo "2*($(data | paste -sd+))" | bc -l
ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | awk '{print $1}' | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'

If you just want some simple plotting, gnuplot is your friend:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'

Data wrangling to make arguments

Sometimes you want to do data wrangling to find things to install or remove based on some longer list. The data wrangling we’ve talked about so far + xargs can be a powerful combo.

For example, as seen in lecture, I can use the following command to uninstall old nightly builds of Rust from my system by extracting the old build names using data wrangling tools and then passing them via xargs to the uninstaller:

rustup toolchain list | grep nightly | grep -vE "nightly-x86" | sed 's/-x86.*//' | xargs rustup toolchain uninstall

Wrangling binary data

So far, we have mostly talked about wrangling textual data, but pipes are just as useful for binary data. For example, we can use ffmpeg to capture an image from our camera, convert it to grayscale, compress it, send it to a remote machine over SSH, decompress it there, make a copy, and then display it.

ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 -
 | convert - -colorspace gray -
 | gzip
 | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -'
PreviousEditors (Vim)NextCommand-line Environment

Last updated 4 years ago

Was this helpful?

BEGIN is a pattern that matches the start of the input (and END matches the end). Now, the per-line block just adds the count from the first field (although it’ll always be 1 in this case), and then we print it out at the end. In fact, we could get rid of grep and sed entirely, because awk , but we’ll leave that as an exercise to the reader.

You can get stats in a variety of ways. is pretty neat, but if you already have :

R is another (weird) programming language that’s great at data analysis and . We won’t go into too much detail, but suffice to say that summary prints summary statistics for a vector, and we created a vector containing the input stream of numbers, so R gives us the statistics we wanted!

can do it all
st
R
plotting