Anonymizing Logs in NGINX and Apache

Here’s some tips on ways to anonymize your Website logs. Note that these are just tips, not an extensive overview of it.

Plan

In here we will setup the software(s) to

  • stop software from logging the last octet of the IP address
  • remove last octet of IP address in access logs (200, 30x, 404)
  • log full ip address on hack / access denied attempts.

Nginx

In your http section of nginx.conf, we will be adding/modifying some areas.

First, we need to get NGINX to set the anonymized ip address:

    map $remote_addr $remote_addr_anon {
        ~(?P<ip>\d+\.\d+\.\d+)\.    $ip.0;
        ~(?P<ip>[^:]+:[^:]+):       $ip::;
        127.0.0.1                   $remote_addr;
        ::1                         $remote_addr;
    # ip addresses of your server, to not anonymize
    #    w.x.y.z                     $remote_addr;
    #    a:b:c:d::e:f                $remote_addr;
        default                     0.0.0.0;
    }

If you are using NGINX as a reverse proxy, you’ll want to anonymize the http_x_forwarded_for too.

    map $http_x_forwarded_for $http_x_forwarded_for_anon {
        ~(?P<ip>\d+\.\d+\.\d+)\.    $ip.0;
        ~(?P<ip>[^:]+:[^:]+):       $ip::;
        127.0.0.1                   $remote_addr;
        ::1                         $remote_addr;
    # ip addresses of your server, to not anonymize
    #    w.x.y.z                     $remote_addr;
    #    a:b:c:d::e:f                $remote_addr;
        default                     -;
    }

In both of these sections, you can optionally add the w.x.y.z (ipv4) and a:b:c:d::e:f (ipv6) ip addresses of your server so it won’t be anonymized. You can also add any other ip addresses that you don’t want anonymized.

Next we need to set the log format to include the anonymized ip address.

    log_format  main  '$remote_addr_anon - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for_anon"';

Now set your log file to use this

access_log  /var/log/nginx/path-to-log-access.log  main;

Log full ip on access denied

If you’d like to, you can still log the full ip address on specific errors, such as for fail2ban that needs to monitor your logs. Of course you should anonymize it later, once the firewall has the info (below).

First, add this to you http section (note that I’ve commented out 404 errors):

    map $status $record_error {
        #~^[23]  0;
        400      1;
        401      1;
        403      1;
        #404     1;
        405      1;
        406      1;
        410      1;
        default 0;
    }

Feel free to modify/add according to your needs. It accepts regular expressions, as the first commented line shows.

And we will need to have a log format that records the real ip address (in http too):

    log_format  real  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

Now, when you want to log, you will have two lines. One for the anonymized lines, and one with the real ip address.

    # anonymized entry
    access_log /var/log/nginx/path-to-log-access.log  main if=!$record_error;
    # real ip address on errors
    access_log /var/log/nginx/path-to-log-access.log  real if=$record_error;

Add this to your server areas as you normally would.

Apache

Apache has many ways to make the ip addresses anonymized. Here’s a few:

mod_ipv6calc

mod_ipv6calc will anonymize all ip addresses in Apache, before writing to the log. Install it with the ipv6calc-mod_ipv6calc or mod_ipv6calc, depending on your distro.

Does not appear to work if Apache is a reverse proxy (ie, if you are using %a). See the next section about how to do this by using a pipe.

The config is stored as /etc/httpd/conf.d/ipv6calc.conf or /etc/apache2/conf.d/ipv6calc.conf. First, load the module by uncommenting

LoadModule ipv6calc_module modules/mod_ipv6calc.so

Then make sure it’s on

ipv6calcEnable                          on

Set the level of anonymizing you want. For example, this takes out quite a lot:

ipv6calcOption anonymize-preset        anonymize-careful

Or if you have you GeoIP databases (or others), you can use this:

ipv6calcOption anonymize-preset        keep-type-asn-cc

The anonymized ip address is set to IPV6CALC_CLIENT_IP_ANON, so you’ll want to enable logging with something like this:

LogFormat "%{IPV6CALC_CLIENT_IP_ANON}e %{IPV6CALC_CLIENT_COUNTRYCODE}e %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"  vhost:%v" combined_anon
CustomLog "logs/access_log" combined_anon

Be sure to turn off your main logging, and/or set logging in your VirtualHosts files. Test it out and restart Apache

httpd -t
systemctl restart httpd

You’ll know it’s been loaded when you see something like in your error_log file

...[ipv6calc:notice] ... internal main     library version: 1.0.0  API: 1.0.0  (shared)                                            
...[ipv6calc:notice] ... internal database library version: 1.0.0  API: 1.0.0  (shared)                                            
...[ipv6calc:notice] ... configured module actions: anonymize=ON countrycode=ON asn=ON registry=ON        
...[ipv6calc:notice] ... default module debug level: 0x00000000 (0)                                                                
...[ipv6calc:notice] ... module cache: ON (default)  limit=20 (default/minimum)  statistics_interval=0 (default)

Anonymize by pipes

You can also do the same by piping the log through ipv6loganon.

LogFormat "%a %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" vhost:%v" combined
CustomLog "|/usr/bin/ipv6loganon --anonymize-careful -f -a /var/log/httpd/access_log" combined

This is very useful if Apache is being used as a reverse proxy, as mod_ipv6calc doesn’t work in that case.

Log ip only on errors

Another way to make things simple yet keep things private is to only log the ip addresses of actual errors or access denied attempts. The following will keep a log of things accessed, but will only add the ip address for http status codes of 400, 401, 403, 405, 406, and 410.

LogFormat "%400,401,403,405,406,410a %400,401,403,405,406,410l %400,401,403,405,406,410u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" common
# Mark requests from the loop-back interface
SetEnvIf Remote_Addr "127\.0\.0\.1" dontlog
# Mark requests for the robots.txt file
SetEnvIf Request_URI "^/robots\.txt$" dontlog
# Log what remains
CustomLog logs/access_log common env=!dontlog

Anonymize on errors

You can combine mod_ipv6calc with logging ip only on errors to. That way it won’t show any ip address on normal requests, but will show an anonymized ip address on errors:

LogFormat "%400,401,403,405,406,410{IPV6CALC_CLIENT_IP_ANON}e %400,401,403,405,406,410{IPV6CALC_CLIENT_COUNTRYCODE}e %400,401,403,405,406,410u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"  vhost:%v" combined_anon
CustomLog "logs/access_log" combined_anon

Here’s an example of what it would look like. The first is an access denied entry, the second is a normal request.

93.184.216.0 US username [12/Mar/2018...] "GET /denied-file HTTP/1.0" 403 ...
- - - [12/Mar/2018...] "GET /normal-access HTTP/1.0" 200 ...

Anonymize old logs

If you have old logs that have the full ip address, you’ll want to remove those. One simple program is ipv6loganon, part of the ipv6calc package.. It reads in a file, and anonymizes the ip addresses.

$ cat /var/log/httpd/access_log
207.46.98.53 - - [01/Jan/2007...] "GET /Linux+IPv6-HOWTO/x1112.html HTTP/1.0" 200 ...
2002:52b6:6b01:1:216:17ff:fe01:2345 - - [10/Jan/2007...] "GET /favicon.ico HTTP/1.1" 200 ...

$ cat /var/log/httpd/access_log | ipv6loganon
207.46.98.0 - - [01/Jan/2007...] "GET /Linux+IPv6-HOWTO/x1112.html HTTP/1.0" 200 ...
2002:52b6:6b00:0:216:17ff:fe00:0 - - [10/Jan/2007...] "GET /favicon.ico HTTP/1.1" 200 ...

It has multiple levels on anonymizing, such as --anonymize-careful. See man ipv6loganon for more info.

Script to anonymize multiple files

If you have several log files, and don’t want to have to go through them one by one, you can use the following script to do them all. (It also sets the log’s modified date to what it was previously).

anonymize-logs /path/to/logfile /path/to/logfile2 /path/to/folder/*.logs

It’s just a quick script, so make sure you have a backup of your logs before using. It intuitively doesn’t run ipv6loganon on files that don’t have the ip address first, such as Apache error logs, but will run a regex on the file looking for and replacing ip addresses.

#!/bin/bash
# (c) Matt Bagley, under the GPL2
# given log file(s), it will anonymize the logs and update (only if needed)

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin

if [ "$1" == "" ] || [ "$1" == "-h" ] ; then
        echo "Usage: $0 logfile.log [logfile2.log] [logfile3.log] ..."
	exit 1
fi

td="$(mktemp -d)"
temp_log=$td/file
temp_log2=${temp_log}-pass2
temp_log3=${temp_log}-pass3

clean_up() {
        rm -f $temp_log $temp_log2 $temp_log3
        rmdir $td
        exit
}
trap "clean_up"  1 2 3 4 5 15

for each in [email protected] ; do
        #echo Looking at $each
        # does it exist?
	if ! [ -f $each ] ; then
	        echo "File not found: $each"
		continue
	fi
	# non-zero?
	if ! [ -s $each ] ; then
	        continue
	fi
	# compressed or not?
	compressed=0
	if [ -n "$(echo $each | grep '.gz$')" ] ; then
	        compressed=1
	fi
	# expand log
	if [ $compressed -eq 1 ] ; then
	        zcat $each > $temp_log
	else
	        cat $each > $temp_log
	fi

        # make sure that none of the lines start with '-'. ipv6loganon does not like this
	# and that no lines have "::: " in them
	cat $temp_log | sed 's/^- /0.0.0.0 /g' | sed 's/:::* /:: /g' > $temp_log2

	# anonymize it (ipv6loganon only does files that have ip address first)
        if [ -n "$(head -n 10 $temp_log2 | awk '{print $1}' | egrep '(\.|:)')" ] \
        &&  [ -z "$(head -n 10 $temp_log2 | awk '{print $1}' | sed 's/[a-fA-F0-9\.:]*//g')" ] ; then
#	        echo Running ipv6loganon on $each
		cat $temp_log2 | ipv6loganon --anonymize-careful > $temp_log3
		cat $temp_log3 > $temp_log2
		rm -f $temp_log3
	fi
	cat $temp_log2 | sed 's/\([0-9]*\.[0-9]*\.[0-9]*\)\.[0-9]*/\1.0/g' \
		| sed 's/\([0-9a-fA-F]*:[0-9a-fA-F]*:[0-9a-fA-F]*:\)[0-9a-fA-F:]*/\1:/g' \
		| sed 's/:::*/::/g' > $temp_log3
	cat $temp_log3 > $temp_log2
	rm -f $temp_log3

	# verify that it's not empty
        if ! [ -s $temp_log2 ] ; then
                echo "$each was processed as empty"
		continue
	fi
	# diff to see if we changed anything
        if [ -n "$(diff -q $temp_log $temp_log2)" ] ; then
	        # if we did, zip and copy file back
		temp_log_ext=""
		if [ $compressed -eq 1 ] ; then
		        gzip $temp_log2
			temp_log_ext=.gz
		fi
		mv $each ${each}-old
		echo Replacing $each
		cat ${temp_log2}${temp_log_ext} > $each
		# set the time to the same as the previous file
		touch --reference=${each}-old $each
		# clean up
		rm -f ${each}-old ${temp_log2}${temp_log_ext}
	fi
	rm $temp_log $temp_log2 $temp_log3 -f
done

clean_up

Then add a weekly cron job for it: /etc/cron.weekly/anonymize-logs

#!/bin/bash

/usr/local/bin/anonymize-logs2 /var/log/httpd/*.gz /var/log/nginx/*.gz /var/log/maillog*.gz

System Logs

Some log files, such as error logs for Apache/NGINX, have ip addresses that get stored in them, and don’t have a built in way to anonymize them. You can use the above script for these files too.

anonymize-logs /var/log/secure*.gz /var/log/maillog*.gz

It’s probably best to not anonymize /var/log/secure and other current ones if you have a firewall or other program that needs the actual ip addresses in order to function.

Bonus: Limit tracking on analytic software

As I was going through here I also found/reviewed these other methods for things that aren’t on the server:

Google Analytics

You can also do this on Analytics:

ga('set', 'anonymizeIp', true);
  • See here for more info.

Matomo

Matomo suggests to enable anonymizing when you first set it up. I did so on my setup. Go to Administration > Privacy > Anonimyze in your dashboard to enable it. See here for more info.

Conclusion

As you can see, it’s simple and easy not to store the full ip addresses. Go ahead and set this up if you haven’t already.