Concepts

Longneck is an ETL tool, aiming to solve data transformation problems. It reads a data source (CSV files, database tables, logfiles etc.), does transformation operations resulting in standardized, cleaned data, and sends the results to a data target (again, CSV, databases etc).

The real value of a data transformation framework are the data transformation rules it contains for a given business problem. Therefore, easy and flexible phrasing of the domain knowledge with reusable data transformation modules is very important, and is the main focus of Longneck. The language we use to describe transformation rules is XML-based, with special elements and rules to describe transformation logic or data constraints.

Data transformation processes read a data source, make some data transformation, check validity and write a data target. They are called simply processes, and are stored in XML files.

Processes work on records, consisting of character valued and named fields. Records are processed on their own, in data streaming fashion, without joins or operators working on the full input data set.

Records describe real-world entities, eg. email addresses, names or tax numbers. Expected properties of data elements describing these entities are collected in *.entity.xml files. Entities define common data format and content for data elements of the given entity, and are used to define cleaned and standardized output of the processes.

The main building blocks of processes are called blocks, and are stored in *.block.xml files. They describe a data transformation step, for example checking and correcting zip codes or standardizing phone numbers. Blocks can be embedded in other blocks and are extendable.

Blocks and entities may refer to dictionaries and word sets. Dictionaries define additional fields for a given key field, eg. browser type and platform for user agent strings. Word sets enumerate valid values of field for an entity, eg. valid TLD parts of email addresses.

We call sets of entities and blocks with word sets and dictionaries transformation rules.

Data are checked during processing against the entities or some custom assumptions at any processing step. We define constraints, and if data violates these constraints, then we face a data error, and send erroneous elements to a special error target. Data errors cannot halt the process, Longneck is robust. Keeping data errors local, they never affect unrelated data or the whole process.

Longneck and its extensions are written in Java, and can be used in all computing environments where a JVM exists. Processes can be executed using the default commands provided in longneck-core from command line (basic CLI). This way efficient parallel execution is achieved using multiple threads running on different processors or processor cores. Own applications can also be easily built (using app-seed), or Longneck might also be used as a Java library linked to data processing applications.

An experimental extension, longneck-storm allows use of distributed environments, scaling our application to arbitrary computer clusters. Thanks to the simple data model and careful implementation, near-linear scalability can be achieved: We can easily estimate the number of computer nodes needed for the desired throughput speed.



Modules and extensions

The following Maven projects are the main modules of Longneck:
longneck-core
This project is the core of Longneck functionality: contains the implementation of the base set of processing blocks and constraints and also provides a command line execution class.
  • contains the definition and implementation of the main processing blocks,
  • supports interpretation of XML blocks,
  • supports linking and using extensions,
  • supports process definitions,
  • provides a fast implementation of processes,
  • and provides CLI class with commands for executing processes.
longneck-storm experimental
Provides an environment to create a jar submittable to a Storm cluster. This module is experimental, and not pushed to github yet.
longneck-lookup
Longneck extension enabling dictionary lookups, adding descriptive fields for fields containing dictionary keys. The extension provides its own XML elements both for defining and using dictionaries, and a fast lookup implementation.
longneck-bdb
An extension enabling use of Berkeley DB Java Edition-based lookup dictionaries.
longneck-weblog
An extension for parsing and processing Apache HTTP server log files. See the weblog parser section for more details.
longneck-dns
An extension doing multi-threaded reverse DNS lookups for IP addresses with efficient caching.
longneck-content-repo experimental
This public repository contains transformation rule-sets for common problems. When starting a new project, you should check this out, if it already contains some building blocks already implemented.
longneck-cdv
An extenison for check digit verification. This tool is applicable for modulo-based linear combination check digits. See the check digit verification section for more details.
app-seed
A seed app is a skeleton for Longneck applications containing structure and elements of a typical Longneck app. To start a new application from scratch, you can use this as a starting point.
weblog-demo
Demo application processing Apache HTTP server log files, filtering and naming conceptual events (eg. page view) and adding derived fields (eg. user agent browser type), producing a clean and rich data set for web analytics. For an introduction, see get started.


Longneck applications

Longneck applications follow a convention for the placement of files. This is a recommendation only, and you may divert from it at your own discretion. However examples on this site assume you follow these conventions.

A Longneck application has the following folders:

  • bin/: Contains the executables to run the application both normally and in various modes, such as debugging, profiling. Additional utilities may be placed here.
    Note: The scripts provided in the App seed project rely on the pom.xml to do much guess work.
  • config/: The place of the configuration files. Any file ending with ".properties" is parsed as a configuration file, and it's content is loaded into the runtime configuration. Note: The file "log4j.properties" is an exception and it is dedicated to the log configuration.
  • lib/: Project dependencies are placed here. Note that the development executable uses dependencies from the Maven repository directly.
  • log/: The default location for application log files.
  • persistence/: Contains database files which are used by various Longneck modules such as the DNS extension.
  • processes/: Process files are put here.
  • repository/: This is where entity and block packages are kept. Each file is named accordingly, with ".block.xml" for block packages or ".entity.xml" for entity packages. When a process is started, only the referred packages are loaded.
  • templates/: The location for template files, which are used to create the executables and default configuration files. This is mostly done by copying the template files to the correct location and then replacing placeholders with properties from the POM.

Build

Longneck modules and appplications are Maven projects, leaving dependency and build management for Maven. Apache Maven, above 3.0.x is needed to compile Longneck apps. A simple $ mvn install then would collect dependencies and build the module or application. It is also worth looking at the various compilation or deployment options and tools provided by Maven.

App seed

The app-seed project provides a starting template for Longneck applications. After an mvn install,

  • all dependencies needed are downloaded and put into /lib,
  • the appliciation is built and placed in /target,
  • executable script are prepared and copied into /bin.

The application contains basic Longneck modules and extensions, and is capable of running processes, without any modifications or enhancements:

 bin/app-seed -p processes/your-process.xml 

Here the name of the application remains app-seed, the process we would like to run is your-process.xml, and we assume that all configuration and repository files referred in the process file are placed in their directories.

No new Java code is needed to initialize a new Longneck application; you can easily

  • update pom.xml,
    • rename Maven artifact, adjust version, update project information,
    • optionally remove unnecessary dependencies (eg. reverse DNS extension),
    • modify build process if necessary, etc.
  • rename and modify template binary scripts,
  • modify config files - app-seed only has one config file for logging, where you can set log level in log4j.properties config file eg. Config files are copied from the templates directory during the build process if they not exist yet, so you can modify the template config file too.
  • create processes directory, and place your process XML files there,
  • create repository directory, and place your block and entity xmls there, with optional dictionaries and word sets.


Computational and data model

In Longneck, data is represented as a stream of records. A record is a set of key-value pairs called fields, and each key and value is stored as a string. This is due to the fact that most data transformations are performed on strings. In case of another type is required, type conversion is performed at the operation level. Each field may also have a number of predefined flags set on it, independent of it's value. These flags can be checked or cleared when desired.

Illustration of a record.
Record model used by Longneck

A process is a sequence of transformations and checks to be performed on the data read from the input. As records traverse the system, the steps of the process are performed on them in a linear fashion. The steps can be categorized into two groups:

  • entities, and constraints impose - as the name implies - constraints on the fields, values and flags of the record.
  • Transformations, which are called blocks, apply a transformation on the data like assigning a value to a field or cutting it's length.

The example below shows an simplified version of the process in the demo application. The process begins with defining the source and target, and then goes on to describe the transformations done in it.

<process xmlns="urn:hu.sztaki.ilab.longneck:1.0">
  <source>
    <weblog-file-source name="cli" />
  </source>
  <target>
    <csv-target target="out.csv"/>
  </target>
  <blocks>
    <!-- Parse log line -->
    <block-ref id="weblogparser:parse"/>
    <if>
      <is-null apply-to="requestUrlFull"/>
      <then>
        <!--
Copy a field
value to another field.
-->
        <copy apply-to="requestUrl" from="requestUrlFull" />
      </then>
    </if>
    <check summary="The request url must not be null.">
      <not-null apply-to="requestUrl"/>
    </check>
  </blocks>
</process>

Example of a process.

The process is made up of three main parts:

  • The <source> element defines the data source where and how the data is read from.
  • <target> specifies where the processed stream of records is written. The process may also have an <error-target>, which provides the configuration for trace outputs.
  • The <blocks> element is the "body" of the process, and contains the transformation steps performed on the records.
  • The <test-cases> element is optional, by that the transformation blocks can be tested.

The available sources and targets are described in detail in the Sources and targets section. The blocks and control structures are introduced in the Process section.


Sources and targets

The source is where records come from. It could be anything from a text file to a relational database, or perhaps an application creating streams of data continuously. The only condition is that the data has to be converted into records. The current package provides implementations for CSV files, and databases through JDBC.

Each source describes, how to access the data, and how to convert it to records. This configuration is usually separated into two parts:

  • The structure of the data is described on the source element in the process file.
  • Access parameters, such as file paths, URLs and passwords are provided as configuration properties in one of the configuration files.

Targets define the way the application writes out the processed data. Like in the case of sources, a number of implementations are provided for the most typical setups. The simplest of all is <null-target>, which simply throws away the output. It's mostly useful for processes with side-effects, like training models. The <console-target> is useful for debugging output.

CSV source

The CSV source is defined with the <csv-source> element. It reads one or more CSV files from a path specified in the configuration. The path is set by the csvSource.NAME.path configuration property, where NAME is substituted with the value specified as the name attribute of the <csv-source> element. Multiple files may be specified, separated by the path separator character (for example, ":" on Linux).

This implementation can be customized to match your CSV dialect, such as the used field separator character, quotation style and column names. The following example shows a common CSV source definition with the name "myfile".

<csv-source name="myfile" delimiter="," has-headers="false" columns="A B C"/>

The source specifies a comma for delimiter, has no header as it's first row, has 3 columns, which are to named A, B and C respectively. The path to the data file can be specified by adding the following line to the configuration properties file:

csvSource.myfile.path=data/myfile.csv

The path may be absolute or relative to the application home directory. As an alternative, you can specify the property on the command line, like this:

$ bin/longneck-app -DcsvSource.myfile.path=data/myfile.csv -p example.process.xml

Database source

The application is known to work with PostgreSQL, MySQL and Oracle JDBC drivers, but it should work with any database.

The implementation for JDBC is <database-source>. As data is served as records in an SQL database, the conversion is pretty straightforward. Field names from the database record are mapped to the Longneck record "as is", and each value is converted to String. The following code example shows the basic usage of a database source:


          <source>
            <database-source connection-name="weblog_demo">
              <query>
                select distinct(visitor_hostname) as "host_name"
                from weblog_event_incoming where visitor_hostname is not null
              </query>
            </database-source>
          </source>
        

Example of a source definition.

The element defines two important things. The connection-name attribute selects a database connection, which is used to connect to the JDBC database. The <query> element provides the SQL query to get the input data.

Configuring database connections

JDBC connections are configured in the configuration files with the following properties:

database.connection.NAME.type=jdbc
database.connection.NAME.url=jdbc:oracle:thin:@localhost:1521:tdb
database.connection.NAME.user=somedb
database.connection.NAME.password=XXXX

The NAME part must be substituted with the value of the connection-name attribute. This connection can also be used with database targets at the same time.

Weblog file source

The longneck-weblog extension enables parsing standard webserver log files. These files are similar to CSV files, but have non-standard CSV content and require some special processing operations. See the weblog extension section, and the examples and the get started pages for more info.

CSV target

The CSV target is the counterpart of the CSV source, and let's you write records into a CSV file. It is defined <csv-target> element in the process file, and it can be customized much the same way as the csv-source. The following example shows the usage of a csv-target:

<csv-target target="data/myoutput.csv" delimiter="," empty-value="-" columns="a=A b=B c=C"/>

The above code tells Longneck to write the output records into the file data/myoutput.csv, separate field values with the comma character, use "-" for empty values, and rename field "a" to "A", "b" to "B" and "c" to "C".

Database target

JDBC targets can be written to by using the <database-target> elements. Take a look at the code example below, which shows a database target in use:


          <target>
            <database-target connection-name="somedb">
              <truncate-query>
                  delete from typetest_out
              </truncate-query>
              <insert-query numeric-fields-to-convert="num1,num2,num3">
               insert into typetest_out
               ( string, num1, num2, num3, dte, tstmp )
               values
               ( :STRING, :NUM1, :NUM2, :NUM3,
                 to_date(:DTE, 'YYYY-MM-DD'),
                 to_timestamp(:TSTMP, 'YYYY-MM-DD HH24:MI:SS' )
              </insert-query>
            </database-target>
          </target>
        

This element also has a connection-name attribute identical to the database source, that selects the connection that is used to access the database. It also defines two queries:

  • The optional <truncate-query> element contains an SQL statement intended to truncate the target table. It can be invoked conveniently with a command line switch when executing the process.
  • The mandatory <insert-query> element contains an INSERT query that is used to insert the data into the database table. The field values of the record are referred by using their names as named parameters in the query.

By default, values in the insert query are bound as String values. This may lead to a problem, when the database does not convert the values to the format of the corresponding table column. For integer columns, the simplest way to solve this problem is to list the fields in the numeric-fields-to-convert attribute. It is a hint to the database accessor to cast the fields listed here to integers. You can also use type casting expressions and functions in the target SQL, eg. insert ... to_date(:date_string_field, 'YYYY-MM-DD') ... into .... Be careful though, because these expressions are not portable in most cases, containing non-standard SQL functions.

Records are inserted in batches. The batch size is determined by the number of records in each queue item, which is set by sourceReader.bulkSize and defaults to 100. The parameter can be set in the properties files of the config application folder. If the batch insert fails, the process falls back to individual inserts, thus only the failed record shall not be inserted.



Data transformation

The process body contains the transformation steps what the process performs on the records. Steps fall into two groups: checks, which are implemented by constraints, and entities; and transformations, which are called blocks.

Check and blocks are grouped according to a modular structure, which are easier and faster to maintain and to keep consistent. For example, we have a process cleaning, standardizing client data, so we implemented a block dealing with email addresses. When a new data source arrives, we only extend this email block with some special cases of the new data source.

Control structures

The control structures provide a way to control the flow of execution in the process.

If-then-else

The <if> element is the basic conditional control structure that performs the checks in it's condition section, and based on the result of these checks executes the <then> or the <else> branch of the structure. The else branch is optional. The code below shows an example of an <if> structure:


          <if>
            <not-null apply-to="public_domain_type"/>
            <then>
              <replace-all apply-to="public_domain_type" regexp="\s(\S*)$" replacement=""/>
              <trim apply-to="public_domain_type"/>
            </then>
            <else>
              <set apply-to="public_domain_type" value="unknown"/>
            </else>
          </if>
        

The above code checks, if the public_domain_type field has a non-null value assigned to it, and if non-null, replaces excessive whitespace with single whitespace characters.

Switch

The <switch> is a multiple execution path control structure. The <switch> element contains a number of <case> -es, which are executed in the order they are defined. If a case executes successfully without errors, it concludes the switch structure, and no other cases are tried. In the case of the normal switch, it's not an error, if all cases fail.

Cases are special elements, because if any check errors that occur inside a case, the changes made in the case are rolled back and the error is not propagated to higher levels. In effect, this means that if the case begins with a <check>, the case content is only executed, if the check evaluates to true. If the checks evaluate to false, a check error is raised, and the execution of the case continues from the next case. Any changes made inside the case so far are rolled back, so it does not have any effect on the record. If no checks are performed in the case, it is always executed, and thus serves as the default case for the switch.

If all cases fail in the <switch> structure, then the record is unchanged, and the execution continues from the end of the switch.


          <switch>
            <!-- Page view events-->
            <case>
              <check summary="Status is 2XX or 3xx and the requested document is a pageview type.">
                <match apply-to="status" regexp="[23]\d{2}"/>
                <equals apply-to="requestUrlExtension" value="html" />
              </check>

              <set apply-to="eventGroup" value="web"/>
              <if>
                <match apply-to="status" regexp="2\d{2}"/>
                <then>
                  <set apply-to="event" value="pageview"/>
                </then>
                <else>
                  <set apply-to="event" value="redirect"/>
                </else>
              </if>
            </case>

            <!-- Error events -->
            <case>
              <check summary="Status is 4XX or 5XX.">
                <match apply-to="status" regexp="[45]\d{2}"/>
              </check>

              <set apply-to="eventGroup" value="web"/>
              <set apply-to="event" value="error"/>
            </case>
          </switch>
        

There are two very similar control structures available in Longneck.

  • The <switch-strict> element works the same way as the <switch>, but it causes a check error, if no case is executed successfully.
  • The other similar one is <try-all> which execute all cases, regardless of the success or failure of each case. The effects of failed cases are rolled back.


Error reporting

Error reporting provides important feedback about the data transformation process. It plays a critical role in improving the quality of the process, and as a result, may also improve data quality. The goal is to provide easily understandable information about the overall success of the process, and at the same time provide detailed information about data errors, when necessary. This section provides details about error reporting and handling in Longneck.

Errors in Longneck come from a variety of sources, and each set of errors requires somewhat different reporting and handling.

  • Algorithmic, and program errors come from the Longneck software itself. These errors should be addressed by fixing the application code.
  • Atomic constraint failures generate errors, when the value of a specific field does not conform to the requirements set by the constraint. The causes of these events are usually easy to explain by looking at the rules at the reported location.
  • Compound constraints can contain complex expressions, which may create a multitude of failures in it's subexpressions, and may make the root cause of the failure difficult to grasp for a human reader.
  • Entity failures have even greater significance than compound constraint failures. They show, that a group of fields as a larger whole of information has failed it's requirements. These failure tell us about the overall success of the process, however the explanation is even more difficult, than in the case of compound constraints.

The further we get away from the basic value requirements, the more difficult it gets for the developer to understand the overall result of a process, and also to see through individual failures to trace the basic causes of a larger failure.

To address this problem, Longneck provides a way to add explanation to each high-level failure event, and stores constraint failures in a hierarchy.

Event hierarchy

The processing of each record may generate a series of failure events during it's processing. These events are structured into a cause - consequence hierarchy. The root causes are leaves of the tree, and each cause is connected to a consequential event.

Some failures do not have important consequences. For example, the first subexpression of an <or> structure may fail, and the <or> can still finish without failure, if the next subexpression does succeed. These events are not logged, since their significance is minor.

The top-level events provide an good overview about the success of the transformation process, but lack the details, that would make them hard to read. To access detailed information the developer may examine each top-level failure event for it's causes, eventually digging down towards the root cause of the problem. This way the level of detail can be adjusted to the level of concern.

Longneck provides a way to supplement high-level failure events with explanatory messages. A summary may be added to each constraint group, that will be included in the failure event log in case of a failure. It's purpose is to explain the nature of the checks implemented by the constraint group. These log messages are starting points to investigate a data quality problem, and limit the amount of detail on each structural level to avoid overwhelming the developer with too much information at once, but can still provide meaningful clues in the process.

The following example shows how to add an explanatory message to an input constraint.


          <block id="address.canonize" version="1.0">

            <input-constraints summary="Either zip code or bm code must be defined.">
              <not-empty apply-to="zip-code"/>
              <not-empty apply-to="bm-code"/>
            </input-constraints>

            <!-- ... -->
          </block>
        

The summary attribute contains an explanatory text that is easy to read.

Error reports

Error reports contain the following information:

  • class_name: the class name of the constraint;
  • field: the name of the field, which is tested;
  • value: the value of the field above;
  • details: additional information about constraint parameters;
  • document_url: the URL of the process or block file, which was executed;
  • document_row and document_column: the row and column in the above document,
  • check_result: the result of the check, true or false;
  • check_id: a unique identifier attached to the event; consists of a node id, a timestamp and a serial number to distinguish events that occurred within one second;
  • check_parent_id: id of another failure event, which was generated as a direct consequence of the current failure event;
  • check_tree_id: id of the failure tree to allow easy querying of each tree;
  • check_level: the level of the event in the process call tree, starting from zero at the root of the tree, incremented by 1 on each level toward the leaves.

Failure event keys must be assigned uniquely to each failure event. The keys should reflect the processing node (if run in a parallel environment), a timestamp and a serial, if more than 1 record is processed per millisecond.

The following SQL is a sample error table for PostgreSQL:

--
-- Sample error table for Oracle
--
-- Note: field_value corresponds to the "value" field in the record.
--
create table sample_err
(
    class_name      varchar(500),
    field           varchar(500),
    field_value     varchar(2000),
    details         varchar(2000),
    document_url    varchar(2000),
    document_line   varchar(20),
    document_column varchar(20),
    check_result    varchar(5),
    check_id        numeric,
    check_parent_id numeric,
    check_tree_id   numeric,
    check_level     numeric
);

To use this table, an error target must be defined in the process file. The following XML code defines the table as the error target, and failures will be written into it.

Performance considerations

If a process creates a lot of events during execution, it may cause the process to consume a large amount of memory, although number of events cannot grow beyond the number of constraints and blocks that are capable of generating an event. The execution may be slowed by repeated calls to “new”, which is a slow operation.



Testing

Data transformation operations - blocks in Longneck - are designed using concrete test cases, where a given output is expected for a given input record. Longneck provides testing mechanisms to enable describing and verifying block functionality, to describe test cases and to help keeping transformation rules consistent and clean, useful for different types of software testing, eg. unit testing blocks, continuous integration or integration testing.

XML elements

When defining a process, we tell Longneck how to get input data (source), where to put output data (target), what is the transformation step performed (blocks) and where to write data errors (error-target). With the optional test-cases element we instruct longneck to test the blocks against given test cases, to test if blocks really do what we expect.

The Longneck XML language provides the optional <test-cases> element, with <test> children, containing one source record, and as many target and error target records as expected. Timeout threshold can also be set for a test case.

A test passes, if observed target records match the expected ones, and the expected error records are the subset of the observed ones. An expected record matches an observed one, if the observed has the expected fields with the expected values (but it may contain other fields too).

Command line parameters

The command line parameter -s or --testingBehavior instructs Longneck how to check the test cases defined in process xml-s and how to handle test failures. The default behaviour (normal) works the following way:

  • Check all test cases defined before getting any records from the data source.
  • If a test fails, exit with an erroneous status code.
  • If all tests are successful, begin reading and transforming records of the source.

If we skip tests, Longneck does not handle the test cases at all. If we use the “tolerant” option, the tests are checked, fails are logged, but fails have no further consequences, the process is executed after the tests.

The command line parameter -v or --verbose instructs Longneck to run tests verbosely. In this case all target and error-target records of all test cases are printed to the standard output and also logged at DEBUG level.

Example

<test-cases>
    <test id="test01" summary="test target" >
      <record role="source">
        <field name="logLine" value="50.134.188.22 - - [28/Mar/2013:12:18:32 +0100] "GET / HTTP/1.1" 302 211" />
      </record>
      <record role="target">
        <field name="virtualhost" value="example.com"/>
        <field name="time" value="2013-03-28 12:18:32" />
        <field name="eventGroup" value="web" />
        <field name="event" value="redirect" />
        <field name="clientip" value="50.134.188.22" />
        <field name="domainName" value="c-50-134-188-22.hsd1.co.comcast.net" />
        <field name="request" value="GET / HTTP/1.1" />
        <field name="status" value="302" />
        <field name="requestProtType" value="HTTP/1.1" />
        <field name="requestProtMethod" value="GET" />
        <field name="requestUrl" value="http://example.com/" />
        <field name="requestUrlExtension" value=""/>
        <field name="user" value="-" />
        <field name="bytesSent" value="211" />
      </record>
    </test>
  </test-cases>


Command-line usage

This section provides an overview of the command line features of the Longneck executable class, provided in longneck-core. We refere here to the longneck app seed project, which provides three executable scripts to run the application. To see the basic command line options, run the application with the --help switch.

$ bin/app-seed --help

This will show the help screen.

usage: app-seed <OPTIONS>

Longneck data transformation tool.
 -D,--define <arg>               Define runtime property <name>.
 -E,--errorTruncateBeforeWrite   Truncate the error table before
                                 processing records.
 -h,--help                       Prints this help screen.
 -l,--maxErrorEventLevel <arg>   The maximum level of errors written by
                                 the error writer.
 -m,--measureTimeEnabled         Enables time measurement on threads.
 -p,--processFile <arg>          Specifes the process file URL.
 -s,--testingBehavior <arg>      Define how to handle test cases: normal
                                 (default), alone, skip, tolerant
 -T,--truncateBeforeWrite        Truncate the target datastore before
                                 processing records.
 -t,--workerThreadsNum <arg>     Number of worker threads on which the
                                 process is running. Default: 1
 -v,--verbose                    Verbose testing
 -X,--executeUtility <arg>       Execute built-in utility <name> instead
                                 of running a process.

Web: http://longneck.sztaki.hu/


Content repository

The content repository contains transformation rule sets for common data integration problems. The main strength of Longneck are modular and flexible data processing descriptions: reusable blocks, entities and processes enable fast data integration process implementation with fewer errors.

We set up a repository containing some basic entities and blocks. These are organized according to business domain and localization, and can be used to start a new application.

The directory path of entites and blocks consists of the following elements:

[domain/subdomain]/[language_code|country_code]/NAME.[language_code|country_code].[block|entity].xml
  • [domain/subdomain] is an optional two-level hierarchy of the application domain; eg., the following:
    • inet/web, inet/email: blocks and entities for Internet data, URLs, email addresses, IP addressess etc.
    • identity: transformation rules for identities, eg. personal names, birth dates,
    • datetime: date and time entity transformations (formatting and parsing)
    • it-log: IT (audit-) log entities and blocks, eg. application events, user names
    • phone: phone number transformation,
    • location: geo-location processing (eg. coordinates) and postal addresses
  • [language_code|country_code]: optional elements for localization; two-letter country or language codes (ISO 3166-1 alpha-2 and ISO 639-1) are provided here, if the blocks/entities are language/country dependent. In this case the codes are also attached to the file name.
  • [block|entity] identifies the type of the given XML description,
  • NAME is the name of the given entity or block.

Dictionaries and word sets are stored alongside blocks and entities:

[domain/subdomain]/[dictionaries|wordsets]/NAME
  • NAME: dictionaries are XML files ending with .xml, wordsets are arbitrary files.
  • they are referred by entities and blocks using this NAME.

Internal references in blocks and entities are relative. This way all these references can be followed inside the repository, and also files can be copied without updating these references. Longneck provides methods to deal with block and entity versions, but for transformation rules we don't use versions - we let GIT store the history of the entities and blocks.

All directories contain a README.md file with the documentation of the given transformation rules. This way, the GitHub repository can be used to browse the content repository documentation. XML files of the repository are self-documented containing meaningful XML comments.



Extensions

Longneck-weblog

Longneck-weblog extension provides a data source parsing and pre-processing web server log files. The main goal is to enable efficient web analytics based on the log files, translating HTTP request events to business events with good quality attributes.

The content repository provides web analytics examples based on the parser, while the examples section describes a web log processing demo application.

Longneck-weblog
  • interprets Apache log files according to the Apache log definition,
  • permits special fields,
  • and constructs useful derived fields (eg. request URL) not included in the original log,
  • eg. decoding URLs from percent encoding.

We only have to configure the parser describing the file format in config/weblog-config.xml, then the data source and the line parser can be included in our process definition:

<source>
  <weblog-file-source name="cli"/>
</source>

<blocks
  <weblog-line apply-to="logLine"/>
  ...
<blocks>

Here, the weblog-file-source element defines the weblog file reader, responsible producing a single field per line called logLine. The weblog-line element parses this line, and produces the fields containing the data elements of the log line according to config/weblog-config.xml.

Log file formats and fields

Longneck-weblog follows the log format definition of Apache; that way standard web server log files can be easily defined and used as data sources. Apache parameters map to the following Longneck fields:

  • %v --> virtualhost: virtual host name of the server answering the request
  • %h --> clientip: remote hostname or ip address
  • %u --> user: remote user if authenticated
  • %l --> identity: remote logname supplied by identd
  • %t --> time: time
  • %r --> request: first line of request
  • %s --> status: final status code (not counting 100 CONTINUE)
  • %b --> bytesSent: bytes sent, including headers
  • %{Referer}i --> refererUrl: referer URL header field of the request
  • %{User-Agent}i --> userAgent: user agent header field of the request
  • %{Cookie}i --> requestCookie: cookie header received in the request
  • %{Cookie}n --> responseCookie: cookie header sent in the response

See http://httpd.apache.org/docs/current/mod/mod_log_config.html#formats for more details on Apache log formats.

Request and virtualhost fields are handled together; after processing request, virtualhost fields, we get the requestProtMethod, requestProtType, requestUrl fields. For example:

Input:
    virtualhost:         www.example.com
    request:             POST /mysite/index.php?id=123&value=abc HTTP/1.1
Output:
    virtualhost:         www.example.com
    request:             POST /mysite/index.php?id=123&value=abc HTTP/1.1
    requestUrl:          http://www.example.com/mysite/index.php?id=123&value=abc
    requestProtMethod:   POST
    requestProtType:     HTTP/1.1

RequestUrl field is split, resulting in requestUrlFull, requestUrlParameter, requestUrlExtension, requestUrlParameter-{parametername} . All URL parameters are split into an own field (if not turned off in weblog-config.xml: <create-url-parameters>false</create-url-parameters>), where the default is true). For example:

Input:
    requestUrl:          http://search.private:8080/search.go?appid=xsearch&xsearchQ=%22cats%22
Output:
    requestUrlFull:      http://search.private:8080/search.go?appid=xsearch&xsearchQ="cats"
    requestUrlParameter: appid=xsearch&xsearchQ="cats"
    requestUrlExtension: go
    requestUrlParameter-appid: xsearch
    requestUrlParameter-xsearchQ: "cats"

If a secondary url-decoding-charset is given in config/weblog-config.xml, then we get the fields using this secondary encoding too, named as requestUrlFull2, requestUrlParameter2, requestUrlParameter2-{parametername}. The default primary url-decoding-charset is UTF-8: <url-decoding-charset>UTF-8</url-decoding-charset>. By default we don't have secondary encoding, and don't get secondary fields.

The goal of the alternative URL encoding is purely technical: we found that in many cases different levels of web services use different encodings, resulting in merged frontend logs with URLs using mixed encoding. In these cases identifying the correct encoding is not possible, the best we can do is to decode the URLs in alternative ways, and let the Longneck user choose the appropriate value for the given URL parameter.

Parser configuration

The mapping between Apache and Longneck field names is defined in logdefinition.xml of longneck-weblog, along with the regular expressions required by the parser to identify the given field. These fields are used in the parser configuration file config/weblog-config.xml of the application. For example, the following config file implements a standard Common Log Format parser:

<?xml version="1.0" encoding="UTF-8"?>
<log-format xmlns="urn:parser"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <log-config>%h %l %u %t \"%r\" %>s %b</log-config>
</log-format>

Additional custom fields may also be introduced by adding a log config field name, a Longneck field name and a regular expression. The secondary URL encoding and the setting of whether URL parameter fields should be created can also be defined here; for example:

<log-config>%{loglevel} - %h %l %u %t \"%r\" %>s %b</log-config>

<url-decoding-charset2>ISO-8859-1</url-decoding-charset2>
<create-url-parameters>true</create-url-parameters>

<log-element>
    <type>%{loglevel}</type>
    <name>loglevel</name>
    <regex>(\S+)</regex>
</log-element>

Here we introduce the loglevel field which starts every log line, and also instruct the parser to decode URL parameters according to a secondary encoding, and to construct URL parameter fields.

Check Digit Verification (CDV)

A check digit is a form of redundancy check used for error detection on identification numbers (e.g. bank account numbers). Using the CDV extension constraints can be defined based on check digit rules. Only the coefficients of the positions have to be provided in the linear combination rule with the modulus. If the combination is zero, the constraint is evaluates to true, else it fails.

The following example validates Hungarian social security numbers:

<check summary="SSN CDV">
  <cdv apply-to="social-security-number" coefficients="3 7 3 7 3 7 3 7 -1" mod="10"/>
<check>


Dictionary lookup