validate

What

The purpose of the validate command is to check that a AncientMetagenomeDir metadata file confirms the specifications of the project.

When

In most cases, validate is run for you via automatic Pull Request checks on the AncientMetagenomeDir GitHub repository.

However in some cases, you may wish to run this manually yourself before opening your pull request.

How

Before you run validate, you normally need a local copy of AncientMetagenomeDir, but primarily the AncientMetagenomeDir .tsv file, and the corresponding .json schema file.

To run, execute the following command:

AMDirT validate ancientmetagenome-hostassociated_samples.tsv ancientmetagenome-hostassociated_samples_schema.json

With the .tsv file in the first positional argument, and the .json schema in the second position.

validate can be run on both sample and library tables.

By default, no schema validation checks will be performed however it will check the files can be loaded as expected.

The tool offers different types of validations that are activated by options.

  • schema validation (-s): compares the TSV against the schema

  • duplicate line checking (-d): checks there are no duplicate lines

  • column presence/absence checking (-c): checks that all expected columns are present

  • doi duplication checking (-i): checks there are not the same DOI across two different publications

  • ENA accession validation (-a): checks the ENA accession code is valid and corresponds to the PRJ associated code

  • multi-value column checking: allows you to check there are no duplicates within a comma separate column with multiple values (e.g. sample accessions)

You can also change the output console formatting from a python-rich format (default) to a markdown formatting with (-m). The markdown formatting is primarily for the AncientMetagenomeDir automatic checks on GitHub.

For example, to check all columns are present, there are no duplicate lines, and the table matches the schema you can run

AMDirT validate -s -d ancientmetagenome-hostassociated_samples.tsv ancientmetagenome-hostassociated_samples_schema.json

If the table is valid you will get a corresponding message as follows:

2022-12-16 11:15:41.109 INFO    AMDirT: ancientmetagenome-hostassociated_samples.tsv is valid

If the table is invalid, you will get a table containing a list of the found errors:

           AMDirT Validation Report of ancientmetagenome-hostassociated_samples.tsv against ancientmetagenome-hostassociated_samples_schema.json
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Error                   ┃ Source     ┃ Column      ┃ Row ┃ Message                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Schema Validation Error │ Homo saens │ sample_host │ 3   │ 'Homo saens' is not an accepted value.                                                        │
│                         │            │             │     │ Please check                                                                                  │
│                         │            │             │     │ https://spaam-community.github.io/AncientMetagenomeDir/assets/enums/sample_host.json          │
│ Schema Validation Error │ -1243.386  │ latitude    │ 5   │ -1243.386 is less than the minimum of -90                                                     │
└─────────────────────────┴────────────┴─────────────┴─────┴───────────────────────────────────────────────────────────────────────────────────────────────┘
Invalid dataset ancientmetagenome-hostassociated_samples.tsv

Where Source is the problematic cell entry, the Column and Row where the error is found, and the reason why it is problematic in the Message.

⚠️ The row index includes the header! For example row 3 correpsonds to the the second sample row in the table.

You can then use this information to correct your table before opening your GitHub pull request to the AncientMetagenomeDir repository.