experimental The page is under development.

Web analytics: log processing

In the followings we review some parts of the weblog-demo Longneck application. The application can be downloaded from the downloads page. The get started page contains instructions how to run weblog-demo.

Web analytics deals with visitor interaction of web sites, identifying events, collecting, cleaning and completing attributes of web events. One way to get information on visitor interaction is page tagging, where scripts embedded in the web pages send data to a analytics server. An other way is processing web server log files containing http request events. Here we extracting valuable knowledge from the http request data. The example weblog-demo application aims to

  • parse web server log file fields,
  • transform and verify data, eg. do URL decoding, construct full request URL, check IP address format,
  • identify web analytics events from http request events - eg. page views,
  • filter out relevant lines,
  • check, clean and extend fields with external data, eg. do reverse DNS lookup,
  • write the output dataset into database or CSV files.

These data preparation steps allow doing accurate web analytics by providing a rich and good quality data set containing non-technical events on a higher level of abstraction.


configuring the weblog parser

The longneck-weblog extension provides web server log file parsing functionality. Several log file formats are used. W3C defines the standard Common Log Format, but there are other frequently used formats too, such as the combined format of Apache.

The Longneck extension uses Apache access log format strings to define the input data source. This way the settings of an Apache HTTP server can be used seamlessly to configure the Longneck file data source. This format string is placed in the log-config element of config/weblog-config.xml:


          <log-format>
            <!--
              %v: virtual host
              %h: remote hostname or ip address
              %l: remote logname supplied by identd
              %u: remote user if authenticated
              %t: time
              %r: first line of the HTTP request
              %s: final status code (not counting 100 CONTINUE)
              %b: size of response in bytes, excluding HTTP headers.

              Source: http://httpd.apache.org/docs/current/mod/mod_log_config.html#formats
            -->
            <log-config>%h %l %u %t \"%r\" %>s %b</log-config>
          </log-format>
        

After configuration, the weblog-file-source element can be used as data source in processes:


           <source>
             <weblog-file-source name="cli"/>
           </source>
        

Here we don't refer to a specific file: the "cli" name implies that the file name would be given as a command line parameter (as -DweblogSource.cli.path=<LOG_FILE>).


weblog events: filtering and duplicating records


database target


csv target


handling data error



CSV import / export

Importing CSV files to databases and exporting from databases is a common data integration task. We created a simple Longneck application aiming to solve such problems.

  • Longneck JDBC data sources and targets enable flexible configuration of source/target tables, SQL-s, types, formats, and can achieve the speed what the JDBC driver offers.
  • Longneck CSV data sources and targets enable flexible CSV file handling,
  • Data transformation blocks might be defined on demand to correct known limitations of our data sets.
  • Longneck is robust: it notifies us about parsing or other data errors, but tries to process every line and field.

The main goal of the csv-impexp application is to demonstrate the strength of Longneck data sources and targets, but it can be used as a standalone application substituting standard database import/export utilities.