Analyzing S3 and CloudFront Access Logs with AWS RedShift

20 min read
awsredshift

It's not really a surprise that we deal with fairly high volumes of user traffic at theScore. We use AWS for pretty much all of our hosting needs and their two object distribution services - CloudFront and S3 - are two of the most heavily used offerings in our environment. S3 provides file storage for essentially unlimited data, split across as many "buckets" as you would like to create. Separate from that is CloudFront, which is a content distribution network similar to Akamai or Fastly. CloudFront points at one or more origins, and adheres to standard HTTP caching headers. Typically we will deploy S3 buckets or application servers behind CloudFront "distributions" - this allows the majority of HTTP requests to our properties be soaked up by CloudFront edge servers.

On a daily basis this works very well for us. The challenge comes though when we need to perform any analysis on these logs. The S3 and CloudFront logs for us are easily terabytes of data per year, and traditional log parsers tend to not handle that size of data. There are some providers out there that specialize in parsing these logs, and none of them have been able to provide reports on a dataset of this size unfortunately.

Columnar Databases

So enter RedShift. RedShift is a fairly new offering from AWS. It's a columnar database built off of PostgreSQL. The scope of this post doesn't go into the details of columnar data stores, but the TL;DR is that instead of arranging data on disk by records (or rows) the data is stored by each column. So say you have a classic "users" table with username, e-mail, and password columns. In MySQL on disk the data will be stored in exactly that order first for user ID 1, then 2, then 3 and so on. With a columnar store all the usernames are first stored, and then afterwards all of the e-mail addresses, and then all of the passwords.

This drives a couple properties that make RedShift and others very fast when used for data analysis. Due to the columns being grouped together physically, running reports across specific columns of your data is very, very fast. For the above contrived "users" example, this isn't very relevant. Instead however think about querying invoices in this type of story - very often you'd like to run reports that are grouped by the company responsible for the invoice, the payment method, and perhaps summary information on the dollar amount of the invoice.

There are of course several downsides that would stop you from using RedShift or its brethren for online transaction processing. Namely pretty much none of them are ACID compliant, which is a big deal if you actually value your data (or at least, it's a big deal when you need ACID compliance). The other primary disadvantage is that since the data is essentially being stored in a non-natural form, inserts and updates are much more costly than a standard database and updates to the data definition are usually impossible without recreating tables.

With these constrains though these databases are still a fantastic tool when paired with a more traditional SQL database. Usually you will feed the data from your OLTP system into your data warehouse on a scheduled basis.

RedShift for log data

Log data is an interesting case for RedShift. In our environment as mentioned previously we have so much log data from our CloudFront and S3 usage that nobody could conceivably work with those datasets using standard text tools such as grep or tail. Many people load their access logs into databases, but we have not found this to be feasible using MySQL or PostgreSQL due to the fact that ad-hoc queries run against sets with billions of rows can take hours. Once imported into RedShift the same queries take minutes at the most.

Creating a RedShift instance

Make sure you have the latest version of the AWS CLI tools. You can install this if you already have Python and pip by simply running pip install awscli

~$ aws redshift create-cluster\
 --cluster-identifier logdata \
 --cluster-type single-node \
 --node-type dw1.xlarge \
 --master-username luke \
 --master-user-password password \
 --cluster-security-groups default

Wait for the cluster to boot - you can check the status at any time using:

~$ aws redshift describe-clusters --cluster-identifier logdata

This should output quite a bit of useful data about your cluster:

{
  "Clusters": [
    {
      "PubliclyAccessible": true,
      "MasterUsername": "luke",
      "VpcSecurityGroups": [],
      "NumberOfNodes": 1,
      "PendingModifiedValues": {},
      "ClusterVersion": "1.0",
      "AutomatedSnapshotRetentionPeriod": 1,
      "ClusterParameterGroups": [
        {
          "ParameterGroupName": "default.redshift-1.0",
          "ParameterApplyStatus": "in-sync"
        }
      ],
      "DBName": "dev",
      "PreferredMaintenanceWindow": "wed:03:00-wed:03:30",
      "Endpoint": {
        "Port": 5439,
        "Address": "your-endpoint.us-east-1.redshift.amazonaws.com"
      },
      "AllowVersionUpgrade": true,
      "ClusterCreateTime": "2014-05-12T17:13:26.280Z",
      "ClusterSecurityGroups": [
        {
          "Status": "active",
          "ClusterSecurityGroupName": "default"
        }
      ],
      "ClusterIdentifier": "logdata",
      "ClusterNodes": [
        {
          "NodeRole": "SHARED",
          "PrivateIPAddress": "snip",
          "PublicIPAddress": "snip"
        }
      ],
      "AvailabilityZone": "us-east-1d",
      "NodeType": "dw1.xlarge",
      "Encrypted": false,
      "ClusterRevisionNumber": "786",
      "ClusterStatus": "available"
    }
  ]
}

You can see that the cluster status is set to "active". Also shown is the endpoint that you should use to connect to the RedShift cluster; the standard Postgres command line tools should work just fine:

~$ psql -h your-endpoint.us-east-1.redshift.amazonaws.com -p 5439 -U luke -W

Creating the initial schema

Note that we will be creating separate tables for S3 and CloudFront logs, simply because the format differs in some small ways. Common columns share their names so that any queries you end up writing against the data will work in either.

Run the following creation statements through the PostgreSQL client:

CREATE TABLE s3_logentries (
  creator VARCHAR(MAX) ENCODE RUNLENGTH,
  bucket VARCHAR(255) ENCODE LZO,
  logdate VARCHAR(MAX) SORTKEY ENCODE LZO,
  logtime VARCHAR(MAX) ENCODE LZO,
  cip VARCHAR(50) ENCODE LZO,
  requestor VARCHAR(MAX) ENCODE LZO,
  requestid VARCHAR(MAX) ENCODE LZO,
  operation VARCHAR(MAX) ENCODE LZO,
  requestkey VARCHAR(MAX) ENCODE LZO,
  uri VARCHAR(MAX) DISTKEY ENCODE LZO,
  status VARCHAR(MAX) ENCODE LZO,
  errorcode VARCHAR(MAX) ENCODE LZO,
  bytessent VARCHAR(MAX) ENCODE LZO,
  objectsize VARCHAR(MAX) ENCODE LZO,
  totaltime VARCHAR(MAX) ENCODE LZO,
  turnaroundtime VARCHAR(MAX) ENCODE LZO,
  referer VARCHAR(MAX) ENCODE LZO,
  useragent VARCHAR(MAX) ENCODE LZO,
  versionid VARCHAR(10) ENCODE LZO
);

CREATE TABLE cf_logentries (
  logdate VARCHAR(MAX) SORTKEY,
  logtime VARCHAR(MAX) ENCODE LZO,
  edge VARCHAR(40) ENCODE LZO,
  bytessent INT ENCODE LZO,
  cip VARCHAR(50) ENCODE LZO,
  method VARCHAR(50) ENCODE LZO,
  host VARCHAR(MAX) ENCODE LZO,
  uri VARCHAR(MAX) DISTKEY ENCODE LZO,
  status VARCHAR(20) ENCODE LZO,
  creferrer VARCHAR(MAX) ENCODE LZO,
  useragent VARCHAR(MAX) ENCODE LZO,
  cs_uri_query VARCHAR(MAX) ENCODE LZO,
  cookie VARCHAR(MAX) ENCODE LZO,
  x_edge_result_type VARCHAR(MAX) ENCODE LZO,
  x_edge_request_id VARCHAR(MAX) ENCODE LZO,
  x_host_header VARCHAR(MAX) ENCODE LZO,
  protocol VARCHAR(10) ENCODE LZO,
  cs_bytes INT ENCODE LZO,
  time_taken VARCHAR(MAX) ENCODE LZO
);

There are a few things to call out here. You'll note that in most cases there's not really much thought given to the data type - instead of limiting the VARCHARs and what not, it's just relaxed with MAX size constraints for each column. Typically you'd want to size things appropriately; however for our simple examples this really has no effect on the performance of our queries.

Absent are any specifications for index creation and foreign key constraints. RedShift (like most databases of it's type) don't require or even support index specifications as they are not used by the data storage backend or the query planner. Foreign key constraints, since they depend on traditional indexes, are also unsupported although RedShift will accept their creation for informational purposes only.

Additionally there's some new keywords here. The first is the ENCODE keyword. RedShift will compress all data that you import, and each column can have a separate compression definition. In practice if you leave off the compression stanza then RedShift will wait for a couple hundred thousand rows to be inserted and then begin analyzing the columns to select an appropriate compression algorithm. Once you've worked with similar datasets though, you can capture these compression specifications and re-use them in your table creation.

Last are the SORTKEY and DISTKEY options on columns. SORTKEY specifies how the data in a table is stored on disk, and will dictate the natural order. If most of your reports are constrained or ordered by a single column then that would be the best candidate for this key (in our example the date is an obvious one).

DISTKEY needs a little more explaining. RedShift is a multiple node system, and can have anywhere from 1 to 100 database nodes participating in your instance. This test instance, and the one we actually use right now at theScore, are both single instance. However, when you branch out into a multi-instance topology then RedShift will begin sharding your data across the nodes using the chosen DISTKEY. The documentation recommends that you select a distribution key that will be used frequently in JOINs, which is why we pick the URI. We don't actually use multiple nodes here though, so it's just for future reference.

Load some data!

Now that you actually have the schema up and running, let's actually load some data. CloudFront distributions and S3 buckets both can be configured to log to a specific bucket and key; obviously that has to be working for these examples. Let's assume that we have a distribution in CloudFront with the ID of E1DHT7QI9H0ZOB, and that is' logging to a bucket named cloudfront-logs.

RedShift adds a COPY command to the Postgres language that allows you to import logs from that bucket directly. It's quite flexible and allows you to specify delimiters, error thresholds, compression schemas and so on. For our simple example though, we'll just load one month of logs from just one of our CloudFront distributions:

COPY cf_logentries
  FROM 's3://cloudfront-logs/E1DHT7QI9H0ZOB.2014-04-'
  CREDENTIALS 'aws_access_key_id=;aws_secret_access_key='
  DELIMITER '\t' MAXERROR 200 FILLRECORD IGNOREHEADER 2 gzip;

We're loading standard, tab-delimited gzip-compressed log files with the headers ignored. Additionally we're using the FILLRECORD parameter. This causes the import process to pad out missing records at the end of a line with NULL values. This is pretty useful as periodically the CloudFront and S3 log formats change, and they add new columns. Finally, we're telling the command to abort the copy if 200 errors are found.

This could take some - in my example the full load took just over 3 hours:

INFO:  Load into table 'cf_logentries' completed, 137647904 record(s) loaded successfully.
COPY

You can use the same COPY command to load your S3 logs as long as you specify the correct table as the first parameter to the command.

Running Queries

That's pretty much it for the setup. Now you can just run arbitrary queries against the log tables. Here some fairly trivial examples that have come in handy:

SELECT uri, count(*) as hits
FROM cf_logentries
GROUP BY uri ORDER BY hits DESC;

Dead simple; this takes all the records and just spits out the unique URIs by popularity:

                           uri                            |   hits
----------------------------------------------------------+----------
 /api/v1/news                                             | 17017442
 /api/v1/news.json                                        |  9923266
 /api/v1/rivers/nhl/news/pinned                           |  6901556
 /api/v1/rivers/nhl/news                                  |  6280842
 /api/v1/rivers/nba/news/pinned                           |  5227507
 /api/v1/rivers/top_news/news                             |  5051595
 /api/v1/rivers/nba/news                                  |  4705110
 /api/v1/rivers/nfl/news/pinned                           |  4565983
 /api/v1//news.json                                       |  4342518
 /api/v1/rivers/nfl/news.json                             |  4206018
 /api/v1/rivers/nfl/news                                  |  3379180
 /api/v1/nhl                                              |  2994995
 /api/v1/rivers/mlb/news/pinned                           |  2992940
 /api/v1/rivers/mlb/news                                  |  2649503
 /api/v1/rivers/top_news/news/pinned                      |  2516112
 /api/v1/rivers/nhl/news.json                             |  2493204
 /api/v1/rivers/nba/news.json                             |  2481048
 /api/v1/rivers/1/news.json                               |  2278832

As simple as it is, the query is looking at 135 million rows and executes in approximately 10 seconds using the base RedShift instance type. You can modify the query a bit and see requests per day like so:

SELECT COUNT(*),
  LEFT(logdate, 12) as day
FROM cf_logentries
GROUP BY day
ORDER BY day;
  count  |    day
---------+------------
 4180128 | 2014-04-01
 4874722 | 2014-04-02
 4402991 | 2014-04-03
 4781033 | 2014-04-04
 4676276 | 2014-04-05
 4781785 | 2014-04-06
 3863464 | 2014-04-07
 4368877 | 2014-04-08
 5306479 | 2014-04-09
 4870124 | 2014-04-10
 4141415 | 2014-04-11
 4896455 | 2014-04-12
 5314497 | 2014-04-13
 4695349 | 2014-04-14
 4315279 | 2014-04-15
 4569367 | 2014-04-16
 4509744 | 2014-04-17
 3657470 | 2014-04-18
 4069335 | 2014-04-19
 4127096 | 2014-04-20
 4445855 | 2014-04-21
 5043469 | 2014-04-22
 4803794 | 2014-04-23
 4538804 | 2014-04-24
 3929871 | 2014-04-25
 5056818 | 2014-04-26
 4710418 | 2014-04-27
 4640340 | 2014-04-28
 5389785 | 2014-04-29
 4686864 | 2014-04-30

With CloudFront you really should care about your cache hit ratio - maybe it's obvious, but the load on your origin systems decrease as your content becomes easier to cache. This query will look at the most used URLs and give you a cache hit ratio:

WITH cache_hits AS (
  SELECT uri, cs_uri_query,
    COUNT(*) AS hits
  FROM cf_logentries
  WHERE x_edge_result_type
  IN ('Hit', 'RefreshHit')
  GROUP BY uri, cs_uri_query
),
top_uris AS (
  SELECT uri, cs_uri_query, count(*)
  AS total
  FROM cf_logentries
  GROUP BY uri, cs_uri_query
  ORDER BY total DESC LIMIT 1000
)
SELECT top_uris.uri AS uri,
  LEFT(top_uris.cs_uri_query,20) AS qs,
  top_uris.total AS requests,
  cache_hits.hits AS cachehits,
  LEFT(((cachehits::float / requests) * 100), 5) AS hitrate
FROM top_uris
JOIN cache_hits ON top_uris.uri = cache_hits.uri
  AND top_uris.cs_uri_query = cache_hits.cs_uri_query
ORDER BY top_uris.total DESC;

One minute later, I get these results.

                 uri                 |          qs          | requests | cachehits | hitrate
-------------------------------------+----------------------+----------+-----------+---------
 /api/v1/rivers/nhl/news/pinned      | limit=20             |  5301492 |   5136042 | 96.87
 /api/v1/rivers/nba/news/pinned      | limit=20             |  3750216 |   3553899 | 94.76
 /api/v1/rivers/nfl/news/pinned      | limit=20             |  2540830 |   2385556 | 93.88
 /api/v1/rivers/top_news/news/pinned | limit=10             |  2516105 |   2353308 | 93.52
 /api/v1/rivers/mlb/news/pinned      | limit=20             |  2102154 |   1952020 | 92.85
 /api/v1/rivers/nfl/news/pinned      | -                    |  2025153 |   1845123 | 91.11
 /api/v1/rivers/nhl/news             | limit=14             |  2010704 |   1933622 | 96.16
 /api/v1/rivers/nfl/news.json        | limit=10             |  1869261 |   1696612 | 90.76
 /api/v1/rivers/nhl/news             | limit=15             |  1845157 |   1776506 | 96.27
 /api/v1/rivers/nhl/news/pinned      | -                    |  1600064 |   1454696 | 90.91
 /api/v1/rivers/nhl/news.json        | limit=10             |  1515039 |   1379522 | 91.05
 /api/v1/rivers/nba/news             | limit=14             |  1480116 |   1396547 | 94.35
 /api/v1/rivers/nba/news/pinned      | -                    |  1477291 |   1291877 | 87.44
 /api/v1/rivers/nba/news.json        | limit=10             |  1371061 |   1194216 | 87.10
 /api/v1/rivers/top_news/news        | limit=20&embed_ids=t |  1289425 |   1139939 | 88.40
 /api/v1/rivers/1/news/pinned        | -                    |  1167849 |   1000635 | 85.68
 /api/v1/rivers/top_news/news        | limit=4              |  1122927 |   1040506 | 92.66
 /api/v1/rivers/nba/news             | limit=15             |  1114807 |   1051494 | 94.32
 /api/v1/rivers/1/news.json          | limit=10             |  1072153 |    909073 | 84.78
 /api/v1/rivers/mlb/news             | limit=14             |   933096 |    862023 | 92.38
 /api/v1/rivers/mlb/news/pinned      | -                    |   890785 |    755674 | 84.83
 /api/v1/rivers/nfl/news             | limit=16             |   868015 |    807376 | 93.01
 /api/v1/rivers/mlb/news.json        | limit=10             |   820678 |    696375 | 84.85
 /api/v1/rivers/mlb/news             | limit=15             |   805442 |    743153 | 92.26
 /api/v1/rivers/top_news/news        | limit=5              |   739409 |    690218 | 93.34
 /api/v1/rivers/nfl/news             | limit=17             |   623315 |    581964 | 93.36
 /api/v1/rivers/nba/news             | limit=16             |   582077 |    551094 | 94.67
 /api/v1/rivers/nhl/news             | limit=16             |   572628 |    552693 | 96.51
 /api/v1/rivers/nfl/news             | limit=15             |   454028 |    425197 | 93.64
 /api/v1/rivers/nhl/news             | limit=13             |   438636 |    424338 | 96.74
 /api/v1/rivers/ncaab/news/pinned    | limit=20             |   418286 |    333987 | 79.84
 /api/v1/rivers/ncaab/news/pinned    | -                    |   376280 |    287165 | 76.31

These hit rates are decent for most things, but you can definitely see right away how the query parameters can be tuned a bit to provide an ever faster experience for clients consuming these endpoints.

Conclusion

That's pretty much it for now. There's a lot more you can do with this, as the sky is the limit when it comes to the queries you can write against your data. Hopefully this was a beneficial introduction to RedShift with a novel use case. If I got anything wrong let me know!

There are other obvious related things you'd want to do if you use a setup like this (such as automating the data load) that we may cover in a future post.