Skip to content

Track source for reference files

Vermaat requested to merge reference-source into master

Track source for reference files

Previously, the original source for a reference file was implicit:

  1. If accession number starts with LRG_, it came from the LRG FTP archive.
  2. If a download URL is known, it was downloaded from there.
  3. If slice data is known, it was sliced from the NCBI.
  4. If a GI number is known, it was downloaded from the NCBI.
  5. Otherwise, it was uploaded.

In preparation for the removal of GI numbers (#349 (closed)), this had to be revisited. We now store the source explicitely in a new source field on the Reference model. If additional information is needed to re-fetch the file from this source (e.g., download URL), this is stored in a new source_data field (always serialized as a string). This scheme should be both more explicit and more generic.

Subtasks:

  • Add source and source_data columns.
  • Populate columns in migration.
  • Load some example data for migration tests.
  • Use the columns in the retriever, remove use of old columns.
  • Use the columns in cache sync, remove use of old columns.
  • Check use of old columns elsewhere.
  • Follow-up: remove slice_* and download_url columns and make source NOT NULL. #388 (closed) #389 (closed)

Merge request reports

Loading