HOLMES

Background

There is Three main Web Traffic nowadays.

From User
From Normal Robots (googlebot, msnbot, etc.)
From Abnormal Web Crawler

We should give User quickly response with high priority, give Normal Robots right response with lower priority and reject serve for Abnormal Web Crawler.

Currently, we detect by implicit human browsing behavior. A Javascript is embedded into the pages served to the client dynamically. And a event handler for mouse movement or key clicks is included. Robots and Crawlers do not execute the Javascript. But some people disable the Javasript in their browser, or other reasons will lead to the method failed.
We want to analyse the Web Server Access Log for more accurate result.

What

Our goal is to distinguish Human User from Robots by analyse the Access Log.

That means:

Input: Every single record r (belong to) a set of record R.
Ouput: Record with tag.

Example:

single input with many information:
23 2013 28 59 103.0 103.0 xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx XXX.XXX.com GET /abc/def/ghi 200 295 http://abc.def.ghi.com Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1 0.03 - xxx.xx.xxx.xxx - 59 06 1482 80
single output with tag:
(xxx.xxx.xxx.xxx, human)
or (xxx.xxx.xxx.xxx, robot)

How

There is three main method(Paper: Web robot detection techniques: overview and limitations):

Syntactic log analysis
individual field parsing
user-agent mapping
multifaceted log analysis
Traffic pattern analysis
syntactic and pattern analysis
resource request patterns
query rate patterns
traffic metrics
Analytical learning
decision trees
neural networks
Bayesian network
Hidden Markov model

Result

[English Version] (doc.md)
[Chinese Version] (doc_cn.md)

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
conf		conf
doc		doc
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
doc.md		doc.md
doc_cn.md		doc_cn.md
format.sh		format.sh
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

doc

doc

src

src

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

doc.md

doc.md

doc_cn.md

doc_cn.md

format.sh

format.sh

install.sh

install.sh

Repository files navigation

HOLMES

Background

What

How

Result

About

Releases

Packages

Languages

rockspring/holmes

Folders and files

Latest commit

History

Repository files navigation

HOLMES

Background

What

How

Result

About

Resources

Stars

Watchers

Forks

Languages