вторник, 10 декабря 2013 г.

Determination of the files used during Linux boot

1. Introduction

Hi all, tonight I would like to represent a nice approach to solving the problem I faced during my immersion into the Linux booting process. It may seem strange and meaningless, but you start to look at issues differently when you mess with the governmental consumers. In my case quite high requirements of the virtual machines file system integrity were requested. This required a detailed observation of any kind of file system activity (create/read/write/exec) from the init start until login screen.

 I have already asked about it on Stackexchange, and forum experts suggested the following solutions:
  1. Checking the atime file attribute;
  2. Early start of auditd (early indeed - since initramfs);
  3. Usage of systemtap tool or some debuggers on the kernel level. 
Let's look at them in detail (note that following solutions are little bit RHEL-specific). 

2. Access time
So we can get the information about the last time file was accessed (in order to do this you need to set your /etc/fstab file with adding the atime option explicitly to the line with desired volume). Now we should run the script that collect the files which access time timestamps are in range between the reboot and new login window events. 
This simple JSON-formatted configuration file access_time.cfg stores the upper and lower limits for timestamp lookup and target directories to be looked through:
{
 "boot_start": "2013-11-26 15:24",
 "boot_finish": "2013-11-26 15:26",
 "targets":
 [
  "/boot",
  "/bin",
  "/sbin",
  "/etc",
  "/lib", 
  "/lib64",
  "/usr",
  "/root"
 ]
}
Next lets consider this Python script which takes into account the given config and handles the access time timestamps:
#!/usr/bin/python

import operator
import subprocess 
import json
import os
from datetime import *

#This bash command output consists of files sorted by atime value (recursively for every subdir)
cmd = ["ls", "-Rltu", "--time=atime", "--time-style=long-iso"] 

#We don't care about the pictures integrity :)
excluded_extension = [".png", ".svg"]

#Global vars
boot_start = ""
boot_finish = ""

#JSON config read
def load_configs():
 global cmd, boot_start, boot_finish
 with open("access_time.cfg", "r") as f:
  configs = json.load(f)
        #now extending the initial command with target dirs 
 cmd.extend(configs["targets"]) 
 boot_start = configs["boot_start"]
 boot_finish = configs["boot_finish"]

#Run the external bash command and parse its output
def define_accessed_files():
 global cmd
 p = subprocess.Popen(cmd, stdout = subprocess.PIPE)
 print("Looking for files in target dirs...")
 out = p.stdout.read().split("\n")
        #clean the output from empty members
 out_clean = filter(lambda x: x.__len__() != 0, out)
 #create a dictionary with a filenames as keys and remaining symbols as a values
        out_dict = {}
 for line in out_clean:
  if (line[0] ==  "/"):
   key = line[:-1]
   out_dict[key] = []
  elif ("total"  in line):
   continue
  else:
   out_dict[key].append(line)
 return out_dict

#Parse timestamps from the stored output and define the files with suitable values
def transform_accessed_files(out_dict):
 global boot_start, boot_finish, excluded_extension
 start = datetime.strptime(boot_start, "%Y-%m-%d %H:%M")
 finish = datetime.strptime(boot_finish, "%Y-%m-%d %H:%M")
 unsorted = {}
 i = 0
        #The loop below may seem too complicated, but it's just a timestamp parsing
 for root, files in out_dict.iteritems():
  for file in files:
   splitted = file.split()
   path = root+"/"+splitted[7]
   if os.path.isfile(path):
    _, extension = os.path.splitext(path)
    if extension != ".png" and extension !=".svg":
     unsorted[path] = datetime.strptime(splitted[5]+" "+splitted[6], "%Y-%m-%d %H:%M")
     i += 1
     if i%1000 == 0:
      print("{0}".format(i))
        #Now getting rid of the outliers
 truncated = filter(lambda x: (x[1] > start) and (x[1] < finish), sorted(unsorted.iteritems(), key=operator.itemgetter(1)))
 #Save results in JSON file:
 print("Files matched: {0}".format(truncated.__len__()))
 with open("files_used_during_boot", "w") as f:
  json.dump(sorted(map(lambda x: x[0], truncated)), f)

def main():
 load_configs()
 raw_files = define_accessed_files()
 accessed_files = transform_accessed_files(raw_files)

if __name__ == "__main__":
 main()

Unfortunately this decision did not put its best foot forward. It was too imprecise and gave significantly differencing results in the sequence of the experiments. Mostly the problem was caused by my inability to configure rc.local properly. I wanted this Python script run automatically at the end of boot (in this case we would not need the to have "boot_finish" in our config), but I did not manage to do it, so shame on me now.

3. Auditd
Auditd is a one of the Linux informational security model trump cards. This service collects the information about the running system at the kernel level, observing any of system calls. No doubt it can be configured for the read/write/create/execute file system events tracing. But I'm not sure if auditd and init could start simultaneously :) And I think that it is no wonder that auditd is placed closer to the end of the /etc/rc3.d or /etc/rc5.d files, so it just can not track the files accessed before starting its own.

Possible solution is the initramfs reconfiguring in trying to start the auditd even before the init script, but tools convenient for such manipulations seem to be Debian-only.

4. Systemtap
Systemtap is well-known among the programmers and security admins. I guess that systemtap has become a base tool in every software research / penetration test laboratory. Moreover it's a very flexible tool: it can be used both in case if we are interested in syscalls tracing (file system events in particular) and when we need to know which functional objects were called in our process address space text segment during its execution.

But I could not even imagine that systemtap is able to start even before the init script and track the booting process. I very appreciate the Red Hat Support and personally Pushpendra Chavan for help with this perfect tool (unfortunately I don't know developers this method belongs to exactly - otherwise I'd refer to them in the first place). 
So we need to create two simple scripts:
bootinit.sh
#!/bin/sh


# Use tmpfs to collect data
/bin/echo "Mounting tmpfs to /tmp/stap/data"
/bin/mount -n -t tmpfs -o size=40M none /tmp/stap/data

# Start systemtap daemon & probe
/bin/echo "Loading bootprobe2.ko in the background. Pid is :"
/usr/bin/staprun \
   /root/bootprobe2.ko \
   -o /root/bootprobe2.log -D

# Give daemon time to start collecting...
/bin/echo "Sleeping a bit.."
sleep 5

# Hand off to real init
/bin/echo "Starting."
exec /sbin/init 3
  
and bootprobe2.1.stp written in embedded systemtap scripting language:
global ident

function get_usertime:long() {
  return task_utime() + @cast(task_current(), "task_struct", "kernel<linux/sched.h>")->signal->utime;
}

function get_systime:long() {
 return task_stime() + @cast(task_current(), "task_struct", "kernel<linux/sched.h>")->signal->stime;
}

function timestamp() {
  return sprintf("%d %s", gettimeofday_s(), ident[pid()])
}

function proc() {
  return sprintf("%d \(%s\)", pid(), execname())
}

function push(pid, ppid) {
   ident[ppid] = indent(1)
   ident[pid] = sprintf("%s", ident[ppid])
}

function pop(pid) {
  delete ident[pid]
}

probe syscall.fork.return {
  ret = $return
  printf("%s %s forks %d  \n", timestamp(), proc(), ret)
  push(ret, pid())
}

probe syscall.execve {
  printf("%s %s execs %s \n", timestamp(), proc(), filename)
}

probe syscall.open {
  if ($flags & 1) {
    printf("%s %s writes %s \n", timestamp(), proc(), filename)
  } else {
    printf("%s %s reads %s \n", timestamp(), proc(), filename)
  }
} 

probe syscall.exit {
  printf("%s %s exit with user %d sys %d \n", timestamp(), proc(), get_usertime(), get_systime())
  pop(pid())
}

In order to receive the list of files accessed during the booting process in systemtap log format we should implement the following:
  1. Download and install the PROPERLY named versions of systemtap and kernel debuginfo packages (I have been given given this link, but you'd better use this if you're on CentOS);
  2. Create /tmp/stap and /tmp/stap/data
    mkdir -p /tmp/stap/data 
  3. Place bootprobe2.1.stp and bootinit.sh into /root and make them executable:
    chmod +x /root/boot*
  4. Edit bootinit.sh and change 'exec /sbin/init 3' to 'exec /sbin/init 5' if 5 is your default runlevel.
  5. Create the .ko module from bootprobe2.stp
    cd /root
    stap bootprobe2.1.stp -m bootprobe2 -p4 
  6. Reboot.
  7. Halt grub (press Esc or Shift) and press 'a' on the default kernel. At the end of the kernel line enter the following and press enter:
    init=/root/bootinit.sh,
  8. Normal boot will resume. After logging in, kill the stapio process, copy bootprobe2.log out of the tmpfs /tmp/stap/data directory and unmount it.
    killall stapio
    cp /tmp/stap/data/bootprobe2.log /tmp/stap/
    umount /tmp/stap/data  
  9.  Now check the file /tmp/stap/bootprobe2.log file for the list of all files which are read during boot.
5. Conclusions
You can make sure that systemtap provided everything we needed to solve this problem. I tested this script on Centos 6.4 minimal distro with 2.6.32-358.11.1 kernel, and there were about 1400 unique file read/write/create/exec events.

пятница, 6 декабря 2013 г.

Google Earth DEM Python parser

There should have been posted a source code of simple Python application that performs:
  1. Google Earth DEM parsing through the Google Elevation API service;
  2. Storing the received elevation geodata locally in shapefiles.
But when this script has been already implemented, I began to doubt if the terms of Google Maps license allow me to do it. Expert said no, so I'am washing my hands of this.

Nevertheless I had to test nice but quite a buggy tool - pyshp, because I used a local shapefile as a cache being appended every time the Google Elevation API passes the new portion of data (daily limit is 2500 requests). In case if you will follow this way, don't forget to make a backup every time :) and also consider this example provided by Joel Lawhead:
import shapefile
# Polygon shapefile we are updating.
# We must include a file extension in
# this case because the file name
# has multiple dots and pyshp would get
# confused otherwise.
file_name = "ep202009.026_5day_pgn.shp"
# Create a shapefile reader
r = shapefile.Reader(file_name)
# Create a shapefile writer
# using the same shape type
# as our reader
w = shapefile.Writer(r.shapeType)
# Copy over the existing dbf fields
w.fields = list(r.fields)
# Copy over the existing dbf records
w.records.extend(r.records())
# Copy over the existing polygons
w._shapes.extend(r.shapes())
# Add a new polygon
w.poly(parts=[[[-104,24],[-104,25],[-103,25],[-103,24],[-104,24]]])
# Add a new dbf record for our polygon making sure we include
# all of the fields in the original file (r.fields)
w.record("STANLEY","TD","091022/1500","27","21","48","ep")
# Overwrite the old shapefile or change the name and make a copy 
w.save(file_name)

среда, 4 декабря 2013 г.

Сериализация JSON в кодировке UTF-8 в Python 2.6

Что делать, если по каким-то причинам мы сидим на Python 2.6+, и нам необходимо сохранить в файл объект, содержащий кириллические строки? Об этом много говорили люди, заинтересованные в JSON-сериализации иврита. Удивительно - с этим столкнулся даже Максим Дубинин, создатель Гис-лаба. И до сих пор нигде нет рабочего решения. Оказывается, предлагаемого
ensure_ascii=False
недостаточно, надо получившуюся JSON строку явно закодировать в utf-8:
with open(name, "w") as f:
    text = json.dumps(json_obj, indent=4, ensure_ascii=False)
    f.write(text.encode('utf-8'))