Parse Table Data

Description

Parses tabular data that typically has been pasted or saved from a spreadsheet program.

The program is intended to reliably parse any of the following:

  • tab-delimited data pasted from a spreadsheet
  • Data pasted from a comma-separated file
  • Data pasted from a tab-separated file

tab-delimited data pasted from a spreadsheet can contain ambiguities. The most reliable input form is csv that conforms to RFC 4180

The service can automatically detect what field delimiters and quote characters are in use:

  • If the first ten non-blank lines contain five tabs or more, or at least as many tabs than commas, or no commas, tab-delimited is assumed. Otherwise comma-delimited is assumed
  • Double quotes are assumed as the quote character.

Solutions should only use auto behaviour to interpret unknown user data, and should ideally confirm with the user the interpretation of their data. Solutions should not rely on auto detection behaviour, and should specify delimiter and quote if these are known.

Input

<ParseTableData>
  <data/>
  <delimiter/>
  <quote/>
  <output>raw|xml|json|csv</output>
  <keepBlankLines>false|true</keepBlankLines>
  <comment/>
</ParseTableData>

data is the data to be parsed.

delimiter is the delimiter character to use. It can be:

  • auto - automatically detect (default)
  • none - no delimiter, only whole lines are returned
  • tab - tab
  • comma - comma
  • space - space character
  • any other single character

quote is the quote character. It can be:

  • auto - automatically detect
  • none - no quote character
  • any other single character

output indicates what output format should be used:

  • raw - data should contain multiple <row> elements, each of which has multiple <field> elements containing the individual items of data. This is the default.
  • xml - the first line in the data should be interpreted as field names. The data should contain multiple <row> elements, and then multiple elements named the same as the field names.

    Any characters in the field name that are not valid for an XML name are converted to _xNNNN_, where NNNN is the hexadecimal representation of the character (see ISO 9075-14).

    If there are more data field than names in the first row, names of columnX will be used, where X is the ordinal position.

  • json - the first line in the data should be interpreted as field names, and then the data returned as a JSON array of objects. All values are returned as string. All missing value are omitted.
  • csv - data is comma-separated values.

keepBlankLines indicates what should happen to blank lines (or lines containing only delimiters):

  • false - ignore blank lines. This is the default.
  • true - process blank lines

comment specifies a sequence of characters which, if they are at the start of a line, indicates that this is a comment line. The default is "", which means no comment characters.

Example conversions:

Input data

Species,Name,"Keeper's Name"
Dog,Fido,"Jane Doe"
Salamander,Slippy,Fred
"Sea Monkey",,"John Doe",44

raw output

<data>
  <row>
    <field>Species</field>
    <field>Name</field>
    <field>Keeper's Name</field>
  </row>
  <row>
    <field>Dog</field>
    <field>Fido</field>
    <field>Jane Doe</field>
  </row>
  <row>
    <field>Salamander</field>
    <field>Slippy</field>
    <field>Fred</field>
  </row>
  <row>
    <field>Sea Monkey</field>
    <field/>
    <field>John Doe</field>
    <field>44</field>
  </row>
</data>

xml output

<data>
  <row>
    <Species>Dog</Species>
    <Name>Fido</Name>
    <Keeper_x0027_s_x0020_Name>Jane Doe</Keeper_x0027_s_x0020_Name>
  </row>
  <row>
    <Species>Salamander</Species>
    <Name>Slippy</Name>
    <Keeper_x0027_s_x0020_Name>Fred</Keeper_x0027_s_x0020_Name>
  </row>
  <row>
    <Species>Sea Monkey</Species>
    <Keeper_x0027_s_x0020_Name>John Doe</Keeper_x0027_s_x0020_Name>
    <column4>44</column4>
  </row>
</data>

json output

[
  {
    "Species":"Dog",
    "Name":"Fido",
    "Keeper's Name":"Jane Doe"
  },
  {
    "Species":"Salamander",
    "Name":"Slippy",
    "Keeper's Name":"Fred"
  },
  {
    "Species":"Sea Monkey",
    "Keeper's Name":"John Doe",
    "column4":"44"
  }
]

Output

<TableData>
<errorNumber>0</errorNumber> <data/> <columns> <column/> xml format (repeats) ["column1","column2",...] json format </columns> <delimiter/> <quote/> </TableData>

The data is as described above.

columns is only output for xml and json types. It lists all the columns used, including generated column names.

delimiter and quote return the delimiter and quote in use, either passed in or calculated. Delimiters of none, tab and space are returned as words, other delimiters (including comma) as single characters. Quotes of none are returned in words, other delimiters as single characters.

Assuming there are no parameter errors, the service parses the data as best it can and returns an error number 0. It never returns an error because it can not interpret the data, and never returns invalid data.

Errors

103 - Parameter error

Class

com.metrici.xerula.ParseTableDataService