RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.
Hadoop’s HDFS is often used to store large amounts of text-formatted data: log files, sensor readings, transaction histories, etc. Much of this data is “near-structured”: the data has a format that’s obvious to a human observer, but is not made explicit in the file itself.
This looks like is would have potential in identifying and cataloging data files inside Hadoop.
