Skip to content

janhelke/sys-file-indexer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sys-file-indexer

A custom parallel file indexer and hasher

ABOUT

sys-file-indexer indices the directory specified as last argument or the current directory by default.

sys-file-indexer always outputs the result to stdout.

MODES OF OPERATION

sys-file-indexer has the following modes of operation:

  1. Normal mode: outputs a special CSV file that combines the two datasets to generate and that does not contain unique ID. This needs to be processed further by split mode to be useful. No options are necessary.

    Normal mode can benefit from a previous run if data is supplied with the -delta option. In this case, sys-file-indexer uses the data generated by a previous run whenever the modification time of a file has not changed.

  2. Split mode: split mode takes the file generated with the output for normal mode as input and generates either the CSV for the sys_file dataset or for sys_file_metadata. See options -ofile and -ometa.

  3. SQL mode: outputs readily usable SQL INSERT statements that can be piped directly to the database.

  4. Single mode: outputs one single CSV dataset. Useful for testing onty.

EXAMPLE

Generate the normal mode CSV output:

$ sys-file-indexer >../normal.csv

Update a previously generated normal mode CSV:

$ sys-file-indexer -delta=../normal.csv >../new-normal.csv

Split normal mode CSV to generate two datasets:

$ sys-file-indexer -ofile=normal.csv >sys_file.csv
$ sys-file-indexer -ometa=normal.csv >sys_file_metadata.csv

Generate metadata directly into the database (cannot use -delta):

$ sys-file-indexer -sql | mysql ...

PARTITIONING

sys-file-indexer can be run on multiple machines if that leads to an increase in I/O throughput.

host1$ sys-file-indexer -w 1 -wg 3 ... > result1.csv
host2$ sys-file-indexer -w 2 -wg 3 ... > result2.csv
host3$ sys-file-indexer -w 3 -wg 3 ... > result3.csv
host1$ cat result1.csv result2.csv result3.csv > result.csv

TODO

  • Can scan multiple directories

About

Files Indexer and Hasher

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 100.0%