Commit 0fdb35c5 authored by npappas's avatar npappas

include md5sums for downloaded files

parent c55d8e63
......@@ -6,12 +6,58 @@ Data
Downloaded on 17/10/2018.
Contains the taxdump files (nodes.dmp and names.dmp are necessary).
md5sums
```
12c454dfb401eeb4a1c228656763d5ab taxdump.tar.gz
a1794f34a992e2654cf504bd5087f13a names.dmp
1c7533e815153d65ded8ae12d4dfa927 nodes.dmp
#ete3_db created from this taxdump
b166513f7facfa27e1c90c4d233c82cc taxa.sqlite
5e7b8a9b5d2f649a8a5ce5a10464888f taxa.sqlite.traverse.pkl
```
# external_sets
Query used
* NW and NT redundant accessions
1. Run on 17/10/2018
2. Query
https://www.ncbi.nlm.nih.gov/nuccore/?term=(NT_000001%3ANT_999999%5Bpacc%5D+OR+NW_000001%3ANW_999999%5Bpacc%5D)+AND+nuccore_comp_nuccore%5Bfilter%5D
Manually download the summary returned from NCBI.
```
grep -e "^NW_" -e "^NT_" nuccore_result.txt | cut -f1 -d' ' > redundant_accessions.txt
```
3. Get the content of the result
```
python plot_nuccore_summary.py nuccore_result.txt
```
Thus, the accessions can be used for filtering of the following domains:
- plant
- vertebrate_mamalian
- vertebrate_other
- invertebrate
* bacteria_rep
1. Run on 25/10/2018
2. Query
https://www.ncbi.nlm.nih.gov/assembly/?term=prokaryotes%5BOrganism%5D+AND+%22latest+refseq%22%5Bfilter%5D+AND+(%22representative+genome%22%5Bfilter%5D+OR+%22reference+genome%22%5Bfilter%5D)
3. From the result page: `Download Assemblies` -> `Assembly structure report` -> genome_assemblies.tar
4. Extracting the list of accessions to a file
```
tar -xvf genome_assemblies.tar
find ./ncbi-genomes-2018-10-25 -name "GCF_*" -exec sh -c 'grep -v "^#" {}| cut -f7' \; >> accessions.txt
rm -rf ncbi-genomes-2018-10-25/GCF*
wc -l accessions.txt
```
285952 accessions
# Library
......@@ -20,7 +66,7 @@ Manually download the summary returned from NCBI.
### library
Downloaded on 17/10/2018.
Contains all _raw_ fasta files, per domain. Do not change this.
Contains all __raw__ fasta files, per domain. Do not change this.
Domains included:
- archaea
- bacteria
......@@ -31,4 +77,23 @@ Domains included:
- vertebrate_mammalian
- viral
## test_data
* EAV_simulation
- 1M 150bp PE reads,generated from the Equine Arteritis Virus genome reference.
- Reference genome used: `EAV_simulation/reference/NC_002532.2.fa`.
2. MM_samples
- Respiratory samples provided from MM.
- NextSeq PE 150bp.
3. public
* [McIntyre et al. 2017](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1299-7)
- Simulated and biological communities
- Description: [Table S2](https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-017-1299-7/MediaObjects/13059_2017_1299_MOESM2_ESM.xlsx)
- Data: Available [here](https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA/)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment