Statistics at the Command Line for Beginner Data Scientists

Image by Editor

# Introduction

If you are just starting your data science journey, you might think you need tools like Python, R, or other software to run statistical analysis on data. However, the command line is already a powerful statistical toolkit.

Command line tools can often process large datasets faster than loading them into memory-heavy applications. They are easy to script and automate. Furthermore, these tools work on any Unix system without installing anything.

In this article, you will learn how to perform essential statistical operations directly from your terminal using only built-in Unix tools.

🔗 Here is the Bash script on GitHub. Coding along is highly recommended to understand the concepts fully.

To follow this tutorial, you will need:

You will need a Unix-like environment (Linux, macOS, or Windows with WSL).
We will use only standard Unix tools that are already installed.

Open your terminal to begin.

# Setting Up Sample Data

Before we can analyze data, we need a dataset. Create a simple CSV file representing daily website traffic by running the following command in your terminal:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This creates a new file called traffic.csv with headers and ten rows of sample data.

# Exploring Your Data

// Counting Rows in Your Dataset

One of the first things to identify in a dataset is the number of records it contains. The wc (word count) command with the -l flag counts the number of lines in a file:

The output displays: 11 traffic.csv (11 lines total, minus 1 header = 10 data rows).

// Viewing Your Data

Before moving on to calculations, it is helpful to verify the data structure. The head command displays the first few lines of a file:

This shows the first 5 lines, allowing you to preview the data.

date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8

// Extracting a Single Column

To work with specific columns in a CSV file, use the cut command with a delimiter and field number. The following command extracts the visitors column:

cut -d',' -f2 traffic.csv | tail -n +2

This extracts field 2 (visitors column) using cut, and tail -n +2 skips the header row.

# Calculating Measures of Central Tendency

// Finding the Mean (Average)

The mean is the sum of all values divided by the number of values. We can calculate this by extracting the target column, then using awk to accumulate values:

cut -d',' -f2 traffic.csv | tail -n +2 | awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

The awk command accumulates the sum and count as it processes each line, then divides them in the END block.

Next, we calculate the median and the mode.

// Finding the Median

The median is the middle value when the dataset is sorted. For an even number of values, it is the average of the two middle values. First, sort the data, then find the middle:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '{arr[NR]=$1; count=NR} END {if(count%2==1) print "Median:", arr[(count+1)/2]; else print "Median:", (arr[count/2]+arr[count/2+1])/2}'

This sorts the data numerically with sort -n, stores values in an array, then finds the middle value (or the average of the two middle values if the count is even).

// Finding the Mode

The mode is the most frequently occurring value. We find this by sorting, counting duplicates, and identifying which value appears most often:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

This sorts values, counts duplicates with uniq -c, sorts by frequency in reverse order, and selects the top result.

# Calculating Measures of Dispersion (or Spread)

// Finding the Maximum Value

To find the largest value in your dataset, we compare each value and track the maximum:

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv

This skips the header with NR>1, compares each value to the current max, and updates it when finding a larger value.

// Finding the Minimum Value

Similarly, to find the smallest value, initialize a minimum from the first data row and update it when smaller values are found:

awk -F',' 'NR==2 {min=$2} NR>2 {if($2<min) min=$2} END {print "Minimum:", min}' traffic.csv

Run the above commands to retrieve the maximum and minimum values.

// Finding Both Min and Max

Rather than running two separate commands, we can find both the minimum and maximum in a single pass:

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv

This single-pass approach initializes both variables from the first row, then updates each independently.

// Calculating (Population) Standard Deviation

Standard deviation measures how spread out values are from the mean. For a complete population, use this formula:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

This accumulates the sum and sum of squares, then applies the formula: \( \sqrt{\frac{\sum x^2}{N} – \mu^2} \), yielding the output:

// Calculating Sample Standard Deviation

When working with a sample rather than a complete population, use Bessel’s correction (dividing by \( n-1 \)) for unbiased sample estimates:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv

This yields:

// Calculating Variance

Variance is the square of the standard deviation. It is another measure of spread useful in many statistical calculations:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}' traffic.csv

This calculation mirrors the standard deviation but omits the square root.

# Calculating Percentiles

// Calculating Quartiles

Quartiles divide sorted data into four equal parts. They are especially useful for understanding data distribution:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr[NR]=$1; count=NR}
END {
  q1_pos = (count+1)/4
  q2_pos = (count+1)/2
  q3_pos = 3*(count+1)/4
  print "Q1 (25th percentile):", arr[int(q1_pos)]
  print "Q2 (Median):", (count%2==1) ? arr[int(q2_pos)] : (arr[count/2]+arr[count/2+1])/2
  print "Q3 (75th percentile):", arr[int(q3_pos)]
}'

This script stores sorted values in an array, calculates quartile positions using the \( (n+1)/4 \) formula, and extracts values at those positions. The code outputs:

Q1 (25th percentile): 1100
Q2 (Median): 1355
Q3 (75th percentile): 1520

// Calculating Any Percentile

You can calculate any percentile by adjusting the position calculation. The following flexible approach uses linear interpolation:

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk -v p=$PERCENTILE '
{arr[NR]=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr[count]
  else print p "th percentile:", arr[idx] + frac * (arr[idx+1] - arr[idx])
}'

This calculates the position as \( (n+1) \times (percentile/100) \), then uses linear interpolation between array indices for fractional positions.

# Working with Multiple Columns

Often, you will want to calculate statistics across multiple columns at once. Here is how to compute averages for visitors, page views, and bounce rate simultaneously:

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv

This maintains separate accumulators for each column and shares the same count across all three, giving the following output:

Average visitors: 1340
Average page views: 4850
Average bounce rate: 45.06

// Calculating Correlation

Correlation measures the relationship between two variables. The Pearson correlation coefficient ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation):

awk -F', *' '
NR>1 {
  x[NR-1] = $2
  y[NR-1] = $3

  sum_x += $2
  sum_y += $3

  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x[i] - mean_x
    dy = y[i] - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  correlation = (cov / count) / (sd_x * sd_y)

  print "Correlation:", correlation
}' traffic.csv

This calculates Pearson correlation by dividing covariance by the product of the standard deviations.

# Conclusion

The command line is a powerful tool for statistical analysis. You can process volumes of data, calculate complex statistics, and automate reports — all without installing anything beyond what is already on your system.

These skills complement your Python and R knowledge rather than replacing them. Use command-line tools for quick exploration and data validation, then move to specialized tools for complex modeling and visualization when needed.

The best part is that these tools are available on virtually every system you will use in your data science career. Open your terminal and start exploring your data.

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Source link