Using awk and sort to sort and align or pad columns

Some of my most popular tools are awk and sort (and the usual sed and uniq). But sometimes I would like to have a nice print out from awk in aligned format and sort one of the columns after the padding. Today I found out a nice combination and it shows how to do just what I explained in this nice example of printing out the username and home directory from the /etc/passwd file:

cat /etc/passwd | awk -F\: '{printf "%-20s%-20s\n", $1, $6}' | sort -k

English Amber Ale a success

This is my 4th batch of home brewing and I have to say its my best so far. I have included its recipe for anyone to view incase they are curious to try and make it. Again, I ended up with less than 5 gallons, but it still came out really good. Perhaps the fact that I ended up with less than 5 gallons was a good thing, but I will be trying this recipe again.

[download id="2"]

Rsync between two hosts keeping permissions date and owner

Every now and then I turn to rsync to save me from copying files when I looking to do the job more than once. I love how there are so many options for rsync. I also hate that there are so many options for rsync because the right combination can make or break my day. I recently wrote a script that I feel is the best one yet for myself. This script is need of major clean up as I bet it can be done with better functions. But it is serving its purpose for now:

#!/bin/bash

# sync directories between servers with paramaters

# sites dir
SITESDIR="/var/www/sites"

# rsync identity
RSYNCIDENT="ssh -q -i /home/rsync/.ssh/id_rsa"

# remote host to sync from
REMOTEHOST="remote.hostname.com"

# site choice from first arg
ARGSITE=$1

case $ARGSITE in
	site1)
		message="syncing $ARGSITE"
		cmd="update_site site1.domain.com ${SITESDIR}/site1.domain.com/htdocs"
		;;
	site2)
		message="syncing $ARGSITE"
		cmd="update_site site2.domain.com ${SITESDIR}/site2.domain.com/htdocs"
		;;
	*)
		message="could not match site"
		;;
esac

function update_site {
	echo "${message}"

	# set field breaker
	IFS=$'\n'

	dirlist=()

	# build the array with files in the root directory
	for a in `ssh -i /home/rsync/.ssh/id_rsa rsync@www.domain.com "find /var/www/sites/$1/htdocs -maxdepth 1"`
	do
		dirlist=("${dirlist[@]}" "$a")
	done

	# loop through the array of files
	for b in ${dirlist[@]}
	do
		# if we have a path that ends with "htdocs", pass over it because its the root folder
		if [[ $b =~ htdocs$ ]]; then
			echo "..-.. skipping htdocs"
		else
			echo "..+.. syncing $b"

			# use rsync with some arguments to copy from remote source to local destination
			#
			# explanation of arguments:
			#
			# -P, show a progress for transfering files
			# -arzogtp, archive and preserve owner and group values with permissions
			# -s, protect arguments for files with spaces
			# -e, command to be executed for grabing remote files
			/usr/bin/rsync -arzogtp -s -e "$RSYNCIDENT" rsync@www.domain.com:$b "$2"
		fi
	done
}

# execute the command that has been defined by the case statement above
$cmd

My thoughts and notes from Couchbase conference today

In General

I found the conference to be helpful for a newbie to Couchbase (formally known as CouchDB). Of the courses I attended I found the following information interesting and took down some notes:

Performance

  • separating view from other stuff on separate file systems improves i/o
  • disk bandwidth occasionally drops
  • even identical systems may perform differently
  • use a membase-aware smart client:
  • to reduce network hops
  • or run moxi on the client host
  • couchdb
  • caching, etag, if-none-match
  • compression
  • keep-alive to not open new requests
  • btw, couchbase single server has some killer performance increases coming soon to apache couchdb
  • memcached api
  • binary protocal more effcient than the ascii protocol
  • multi-get and multi-set
  • Incr, decr, append, prepend  = less traffic
  • using TTL expiriation, get-and-touch; set and let it go away, not accumulating data; an aggressive TTL
  • couchdb api:
  • head vs get, ?limit=1; using head when you need it. when query something long, use limit 1
  • dont overrely on skip, use starkey instead
  • use built-in reduce functions: _sum, _count, _stats, write views in Erlang
  • keep view index sizes in mind
  • using ?group_level to aggregate over structured keys (very fast)
  • emit null, and use ?include_doc; less disk i/o, faster view generation
  • emit more data, so ?include_docs isn’t needed (avoid random i/o on query)
  • document size
  • fewer items -> less cacheing overhead
  • reduce number of requests the clients make
  • promotes server-side processing with _show functions
  • more context available for flexible maps
  • if you have more in your document, you have more meat in to work with or breaking it all up, like fetching something then appending is too expensive, (huh?)
  • key size for modeling:
  • using short key size. all keys are kept in ram. tracked for replicas, etc.
  • 255 bytes max length, but prefer short keys
  • at couch db layer, id is likewise used in many places, and short ids are more efficient
  • other index types:
  • full text integration
  • geospatial (can be used for non-spatial data too)
  • hadoop connector w/ couchbase server via TAP
  • non obvious models in key/value space:
  • ex. level indirection to remove a bunch of keys:
  • define a master key: eg. ‘obj_rev’: 3
  • define subordanate attribue keys with the master value in the key name “obj_foo-3″
  • increment ‘obj-rev’, and rely on TTL to reap stale attribute items
  • not doing deletes on my own. relying on the cache to delete the data. different ways to setup the data
  • diagnoticsts:
  • ops/sec: is it dropping?
  • ram usage vs high/low water marks: is it getting too close to the high?
  • ram injections: reaching a certain point, keep eye on it
  • cache miss ratio: what effcienty is it at?
  • disk write queue size
  • disk space available: more of a operational thing
  • error conditions:
  • disk write errors in logs
  • uptime reset to a low value
  • out of memory conditions (*oom*): shows in membased stats from the cluster itself
  • swap usage
  • can you write views directly in erlang? yes, there is flag and examples for it. just like map and reduce. you get to define how you want to work with it. works inside the couchdb interface. its turned off by default
  • what do you mean by group level instruction keys?
  • when you emit stuff, it can be an array with lots of items in it. eg. emit a date, then sort by string, but if you split it up, with Year, Month, Day, Time (then use group level to use a group level being awesome)
  • what would i pout data store in an array:
  • it can be alot faster to restore the data from a crash. instead of relying… puts less strain. striping gives you faster performance. if you choose large stripe sizes. forces couch to allocate more and reduce fragmentation.
  • how do you separate your data from view index?
  • in your config there are paramaters. by default they are in the same directory
  • emiting null? better than emiting 1?
  • about one bit more efficient. its minor, dont worry about it
  • compare and contrast performance couch to normal sql setup? trade offs?
  • how my data is normalized. if its normalized, to get my data out i need to do a join across multiple machines, which is hard. if its all in a document, no joins are needed. the underlying philosophy is that all your data is in one place and easy to copy.
  • comment: updating views, not getting downtime when changing a view:
  • if you create a new design doc with a diffeent name, wait for its views, then put the same doc write over the same indexs, and you dont need to wait for the indexes to rebuild. the code is just replaced. (new name and rename it. because it has the same hash, it replaces it) if single doc updates, update handler?
  • i like the update handler, its still doing a round trip to the server, no benchmark for performance.