Prerequisites

Before proceeding, make sure you have all of the following installed:

You may also refer to the user guides of the documentation to read about the basic concepts of Longneck.


The "weblog-demo" application

This demo application processes Apache HTTP server logs and stores them in an SQL database. The main goal is to support web analytics with clean and rich data:

  • read common weblog files and write output into a SQL database,
  • identify web analytics events of log files containing http request events, eg. page views,
  • filter out log lines irrelevant for web analytics,
  • add descriptive fields with easy to understand data, eg. browser platform to user agent strings or domain names to client IP addresses.

The application is pre-compiled, and contains all configuration files required. More details on the application can be found in the examples section.

The demo application can write CSV output files, when defining a <csv-target> instead of a <database-target>; see the "sources and targets" guide. The examples detailed are however more useful if we use some relational DBMS to store and analyze the results. Here we assume PostgreSQL, but the application also works with MySQL. Just substitute the "-pgsql" part in each mentioned file name with "-mysql". The ddl_scripts folder contains scripts to create the required database and tables.


Processing common logs

The first example processes Apache HTTP server log files in common log format.

operation

The application processes the specified logs, standardizes or computes field values, assigns events and writes them into the weblog_event_incoming database table. Using this table, various statistical reports can be created. The log processor uses the notion of events to create metrics. For each log line a number of events are created, each contributing to a certain metric. For example, to get the number of visits to a certain page, one must count the pageview events associated with the URL of the page. The events can be customized or extended by editing the determine.basic-events block in the repository/weblog-events.block.xml block package.

If anything goes wrong, you can review the process by looking at the logs found in the log directory. Data errors with detailed trace information are written into the weblog_event_incoming_err database table (we actually left erroneous data in our test input files to demonstrate error handling functionalities).

data target

In order to store its results, Longneck must have access to an SQL database. The required tables can be created with the database script ddl_scripts/weblog_event_tables_pgsql.sql. Before running the application, you should also review the config/weblog_demo.properties configuration file, and adjust JDBC settings to suite your setup (be aware for example, that the script contains a weak default password).

Note: Some PostgreSQL setups are configured in pg_hba.conf to use "peer" authentication for clients connecting from localhost, which is not supported by JDBC. We recommend switching to "md5" authentication, and using a user/password combination to connect to the database. An example database creation script is provided in the ddl_scripts/ folder, which must be executed as a database user with superuser privileges. MySQL uses user/password authentication by default, so it is unaffected by this problem.
For PostgreSQL, you might need to add locale options to CREATE DATABASE, depending on your installation - these options are included in the comments. The create_database.sql scripts should be executed as a superuser, weblog_event_tables.sql as weblog_demo user.

data source

The application is now ready to be run, but you will need data to work on. It's best to use your own log files, but if you don't have any available, you can download a sample log file in the downloads section. These are anonymised and filtered log files of SZTAKI websites (the DMS and bigdata business intelligence group sites), with the structure unchanged and with very similar content to the original log files.

running the application

To run the application, type the following command, and replace <LOG_FILE> with the path to the log file:

$ bin/weblog-demo -T -E -p processes/weblog-common-pgsql.xml \
          -DweblogSource.cli.path=<LOG_FILE>

Note: You might want to use the -t <NUMBER> switch to set the number of worker threads (the default is 1). This is especially handy here, since the process involves DNS lookups, which add significant delay to the process. Increasing the number of workers parallelizes DNS queries and reduces overall processing time in most cases.

You can also specify a folder instead of a file to process all log files found in the folder. The application can also read gzipped log files with the ".gz" extension, so there is no need to decompress them.

Running the application with the --help switch returns an explanation of the command line switches.

To run the application with a CSV file output, we can use the pre-defined CSV process file:

$ bin/weblog-demo -p processes/weblog-common-csv.xml -DweblogSource.cli.path=<LOG_FILE>
In this case the -T -E parameters - instructing Longneck to truncate the result tables - are meaningless, and therefore omitted.

interpreting the results

The processes/weblog-common-pgsql.xml defines the whole process - see the data transformation guide for the basic concepts, and the XML API / process language reference for all language elements.

Results are written into the weblog_event_incoming table, data errors arrive in weblog_event_incoming_err. Other errors are logged in the log/application.log and log/longneck.log, eg. file parsing errors, where the whole line is affected. Errors remain local: only those processing steps and data elements are omitted, which are affected by an error. Checking the error table for example shows, that a erroneous IP address does not affect processing other fields or records.

We can easily write useful queries on the result table: The following example generates the toplist of the pages according to the page view counts.

select distinct request_url, count(*) page_view_ct
from   weblog_event_incoming
where  event_group = 'web' and event = 'pageview'
group by request_url
order by page_view_ct desc

Or, changing the event value to 'error' indicates, that we should create a robots.txt file.


Processing more detailed logs

While the common log format is widely used, it lacks detailed data of web site visitor interaction (eg. cookies, user agent strings etc.). The web server however can be configured to log all relevant attributes of the http requests, and Longneck is also capable of processing detailed log files. To do that, you must update both the web server and the log processor configuration with a new log format.

web server logging
This step is optional, while the provided log files can be used to try Longneck. However, you can easily configure a web server to produce detailed logs. In case of Apache, add this line to the main web server configuration:

LogFormat "%v %h %u %l %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{Cookie}i\" \"%{Cookie}n\"" detailed

This creates a new log format in the web server configuration with the name "detailed". To actually write logs in this format, you must also add another log directive inside a VirtualHost or main web server configuration:

CustomLog <PATH_TO_LOG_FILE> detailed

To see the results, reload the web server configuration, visit the configured website and do a few clicks to produce data.

data source log format
Longneck contains a weblog file parser, with a default configuration aiming to read common log files. The log file format is specified in the config/weblog-config.xml file, using the same patterns as the Apache HTTP server. Therefore, we can use the Apache sample format above, called "detailed". The simplest way is to copy the configuration file from the templates folder:

$ cp templates/logformat/config-detailed.xml config/weblog-config.xml

This will overwrite the current log format with the "detailed" format.

running the application
Now you both have logs and a log processor to work on it. To process the newly recorded logs, type:

$ bin/weblog-demo -T -E -p processes/weblog-detailed-pgsql.xml \
          -DweblogSource.cli.path=${YOUR_TEST_LOG_FILE}

The weblog-detailed-pgsql.xml process produces richer output, allows cookie tracking, DNS resolution and various other features.

tips

It's fun to play a bit with Longneck; here are some tips:

  • You may find measuring throughput speed of the application interesting.
    • The DNS extension applies clever caching and parallel processing, but the speed of the DNS servers limit the speed of Longneck. You can however turn off the reverse DNS resolution to study true throughput speed by changing the appropriate processing block form <block-ref id="ip-address:ipv4.reverse-dns"> to <block-ref id="ip-address:ipv4.validate">. The new block only validates IPv4 addresses, without trying to resolve them.
    • The -t command line parameter enables multi-threaded execution; with twice as much threads we can reach nearly twice as much throughput speed, until we do not hit I/O limits.
  • Substituting database targets with CSV targets eliminates the need for SQL databases. You only have to change <database-target> to <csv-target> in the process XML; see the "sources and targets" section in the guide for details.
  • Domain hierarchy XMLs are not used by default; however, by running the domain-hierarchy process gives an example of post-processing the result table.


Creating your own application

The demo application is just one example how Longneck can be used. It aims to be a versatile tool to process data available in multiple formats. It can process data from SQL, CSV and structured log data, and it can be extended to use just about any data source that can be used to produce or consume records. To create your own application, you can either modify the demo, or you can use the App Seed project to start from scratch.

Longneck uses Apache Maven 3. We maintain a public Maven repository for Longneck modules and extensions, so the easiest way to build a Longneck application is to use this repo. See the downloads page for detailed instructions.

The processes for the application are stored in the processes/ folder. The documentation for the process language is available on the XML API page. Longneck supports sharing of code by using packages of blocks or entities, that are stored in the repository/ folder. To add new blocks, just copy the package files into the folder and they are ready to use. We have published some blocks in the Content Repository section.