Release Legacy Steps

These steps may be useful for reference.

Generate annotated SEED files

Export SEED files from SVN using generate_ftp_files.py:

  • to export all files (recommended 16GB RAM to checkout large families):

python generate_ftp_files.py --acc all --seed --dest-dir /path/to/seed/files/dest/dir
  • to export accessions listed in a file

python generate_ftp_files.py -f /path/to/rfam_accession_list.txt --seed --dest-dir /path/to/seed/files/dest/dir
  • Alternatively, use writeAnnotatedSeed.pl:

perl writeAnnotatedSeed.pl RFXXXXX

Generate annotated CM files

Export plain CM files using generate_ftp_files.py:

  • to export all files (recommended 16GB RAM to checkout large families)

python generate_ftp_files.py --acc all --cm --dest-dir /path/to/seed/files/dest/dir
  • to export accessions listed in a file python

generate_ftp_files.py -f /path/to/rfam_accession_list.txt --cm --dest-dir /path/to/CM/files/dest/dir
  • alternatively use writeAnnotatedCM.pl:

perl writeAnnotatedCM.pl RFXXXXX 
  • Rewrite CM file and descriptions from SEED using seed-desc-to-cm.pl:

Required files:

  • $CM_no_desc - a CM file to add DESC to (could be 1 CM or all CMs in a single file)

  • $SEED_with_DESC - a seed file with DESC lines (could be 1 seed or all seeds in a single file)

filter out DESC lines to avoid duplicates as some CMs already have DESC lines
grep -v DESC $CM_no_desc > Rfam.nodesc.cm
perl seed-desc-to-cm.pl $SEED_with_DESC Rfam.nodesc.cm > Rfam.cm

check that Rfam.cm contains the correct number of families
cmstat Rfam.cm | grep -v '#' | wc -l

check the number of DESC lines - should be 2 * number of families
grep DESC Rfam.cm | wc -l

generate the final archive
gzip -c Rfam.cm > Rfam.cm.gz 

split annotated Rfam.cm into individual .cm files and create Rfam.tar.gz:

delete any existing files
rm -f RF0*.cm

get a list of Rfam IDs found in Rfam.cm
grep ACC Rfam.cm | sed -e 's/ACC\s+//g' | sort | uniq > list.txt

create an index file to speed up cm retrieval
cmfetch --index Rfam.cm

create individual cm files
while read p; do echo $p; cmfetch Rfam.cm $p > "$p.cm" ; done <list.txt

create an archive
tar -czvf Rfam.tar.gz RF0*.cm

remove temporary files
rm RF0*.cm 

The CM file is used for PDB mapping. The SEED and CM files are stored in FTP archive and are also available from the Rfam website.

Load annotated SEED and CM files into rfam_live

This enables the SEED and CM download directly from the Rfam website. Requires new Rfam.cm and Rfam.seed files.

Use load_cm_seed_in_db.py:

python load_cm_seed_in_db.py /path/to/Rfam.seed /path/to/Rfam.cm

Alternatively, use populateAnnotatedFiles.pl to process families one by one:

perl populateAnnotatedFiles.pl RFXXXXX /path/to/RFXXXXX.cm /path/to/RFXXXXX.seed

Clan competition

Clan competition is a quality assurance measure ran as a pre-processing release step aiming to reduce redundant hits of families belonging to the same clan.

  1. Create a destination directory for clan competition required files

mkdir ~/releaseX/clan_competition     mkdir ~/releaseX/clan_competition/sorted 
  1. Generate clan files using clan_file_generator.py:

  • recommended 16GB RAM

python clan_file_generator.py --dest-dir ~/releaseX/clan_competition --clan-acc CL00001 --cc-type FULL

–cc-type: Clan competition type FULL/PDB

–clan-acc: Clan accession to compete

–all: Compete all Rfam clans

-f: Only compete clans in file

  1. Sort clan files in dest-dir based on rfamseq_acc (col2) using linux sort command and store then in the sorted directory:

sort -k2 -t $'\t' clan_file.txt > ~/releaseX/clan_competition/sorted/clan_file_sorted.txt

or for multiple files cd ~/releaseX/clan_competition and run:

for file in ./CL*; do sort -k2 -t $'\t' ${file:2:7}.txt > sorted/${file:2:7}_s.txt; done 
  1. Run clan competition using clan_competition.py:

python clan_competition.py --input ~/releaseX/clan_competition/sorted --full

–input: Path to ~/releaseX/clan_competition/sorted

–pdb: Type of hits to compete PDB (pdb_full_region table)

–full: Type of hits to compete FULL (full_region table)

Prepare rfam_live for a new release

  1. Populate rfam_live tables using populate_rfamlive_for_release.py:

python populate_rfamlive_for_release.py –all

  1. Make keywords using make_rfam_keywords_table.pl:

perl make_rfam_keywords_table.pl

  1. Update taxonomy_websearch table using updateTaxonomyWebsearch.pl (:warning: requires cluster access):

perl updateTaxonomyWebsearch.pl

Running view processes

:warning: Requires cluster access and updating the PDB fasta file. Note that using PDB view processes plugin will be discontinued in favour of regular updates of the entire pdb_full_region table.

  1. Create a list of tab separated family accessions and their corresponding uuids using the following query:

select rfam_acc, uuid from _post_process where status='PEND';  
  1. Update the PDB sequence file and the path in the PDB plugin

  2. Launch view processes on the EBI cluster:

python job_dequeuer.py --view-list /path/to/rfam_uuid_pairs.tsv --dest-dir /path/to/destination/directory 

–view-list: A list with tab separated rfam_acc, uuid pairs to run view processes on –dest-dir: The path to the destination directory to generate shell scripts and log to

Update PDB mapping

This step requires a finalised Rfam.cm file with the latest families, including descriptions (see FTP section for instructions).

1. Get PDB sequences in FASTA format

wget ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz     

gunzip pdb_seqres.txt.gz    

2. Split into individual sequences

perl -pe "s/\>/\/\/\n\>/g" < pdb_seqres.txt > pdb_seqres_sep.txt    

3. Remove protein sequences using collateSeq.pl:

perl collateSeq.pl pdb_seqres_sep.txt     

mv pdb_seqres_sep.txt.noprot pdb_seqres_sep_noprot.fa    

4. Remove extra text from FASTA descriptor line

awk '{print $1}' pdb_seqres_sep_noprot.fa > pdb_trimmed_noprot.fa    

5. Replace any characters that are not recognised by Infernal

sed -e '/^[^>]/s/[^ATGCURYMKSWHBVDatgcurymkswhbvd]/N/g' pdb_trimmed_noprot.fa > pdb_trimmed_noillegals.fa    

6. Split into 100 files

mkdir files

seqkit split -O files -p 100 pdb_trimmed_noillegals.fa    

7. Run cmscan in parallel

cd files     
cmpress Rfam.cm     
bsub -o part_001.lsf.out -e part_001.lsf.err -M 16000 "cmscan -o part_001.output --tblout part_001.tbl --cut_ga Rfam.cm pdb_trimmed_noillegals.part_001.fa"     
...     
bsub -o part_100.lsf.out -e part_100.lsf.err -M 16000 "cmscan -o part_100.output --tblout part_100.tbl --cut_ga Rfam.cm pdb_trimmed_noillegals.part_001.fa"  

To check that all commands completed, check that every log file contains Successfully completed:

cat part_*lsf.out | grep Success | wc -l    

8. Combine results

cat *.tbl | sort | grep -v '#' > PDB_RFAM_X_Y.tbl    

9. Convert Infernal output to pdb_full_region table using infernal_2_pdb_full_region.py:

python infernal_2_pdb_full_region.py --tblout /path/to/pdb_full_region.tbl --dest-dir /path/to/dest/dir    

The script will generate a file like pdb_full_region_YYYY-MM-DD.txt.

10. Create a new temporary table in rfam_live

create table pdb_full_region_temp like pdb_full_region; 
  1. Manually import the data in the .txt dump into rfam_live using Sequel Ace or similar

:warning: mysqlimport or LOAD DATA INFILE do not work with rfam_live because of secure-file-priv MySQL setting.

  1. Examine the newly imported data and compare with pdb_full_region

  2. Rename pdb_full_region to pdb_full_region_old and rename pdb_full_region_temp to pdb_full_region

  3. Clan compete the hits as described under Clan competition section using the –PDB option

  4. List new families with 3D structures

  • number of families with 3D before select count(distinct rfam_acc) from pdb_full_region_old where is_significant = 1;

  • number of families with 3D after select count(distinct rfam_acc) from pdb_full_region where is_significant = 1;

  • new families with 3D select distinct rfam_acc from pdb_full_region where is_significant = 1 and rfam_acc not in (select distinct rfam_acc from pdb_full_region_temp where is_significant = 1);

Stage RfamLive for a new release

Create MySQL dumps using mysqldump:

  1. Create a new MySQL dump to replicate the database on REL and PUBLIC servers:

export MYSQL_PWD=rfam_live_password

bsub -o mysqldump.out -e mysqldump.err "mysqldump -u <user> -h <hostname> -P <port> --single-transaction --add-locks --lock-tables --add-drop-table --dump-date --comments --allow-keywords --max-allowed-packet=1G rfam_live > rfam_live_relX.sql"

To create a MySQL dump of a single table:

mysqldump -u username  -h hostname -P port -p --single-transaction --add-locks --lock-tables --add-drop-table --dump-date --comments --allow-keywords --max-allowed-packet=1G database_name table_name > table_name.sql

Restore a MySQL database instance on a remote server

2. Move to the directory where a new MySQL dump is located

    cd /path/to/rfam_live_relX.sql    

3. Connect to the MySQL server (e.g. PUBLIC)

    mysql -u username  -h hostname -P port -p    

4. Create a new MySQL Schema

    Create schema rfam_X_Y;    

5. Select schema to restore MySQL dump

    Use rfam_X_Y;    

6. Restore the database

    source rfam_live_relX.sql  

Generate FTP files

  1. Generate annotated tree files

Generate new tree files for the release using generate_ftp_files.py:

  • to export all files (recommended 16GB RAM to checkout large families) python generate_ftp_files.py –acc all –tree –dest-dir /path/to/tree/files/dest/dir

  • to export accessions listed in a file python generate_ftp_files.py -f /path/to/rfam_accession_list.txt –tree –dest-dir /path/to/tree/files/dest/dir

alternatively use writeAnnotatedTree.pl:

perl writeAnnotatedTree.pl RFXXXXX

  1. Generate rfam2go files

Create a new rfam2go export by running rfam2go.pl:

rfam2go.pl > /path/to/dest/dir/rfam2go

Add a header line to rfam2go with release and date.

!version date: <YYYY/MM/DD>

!description: A mapping of GO terms to Rfam release <XX.X>.

!external resource: https://rfam.org/

!citation: Kalvari et al. (2020) Nucl. Acids Res. 49: D192-D200

!contact: rfam-help@ebi.ac.uk

Create md5 checksum of rfam2go file:

cd rfam2go && md5sum * > md5.txt

Run GO validation and review warnings using validate_rfam2go.pl:

perl validate_rfam2go.pl goselect selectgo rfam2go

Generate Rfam.full_region file

Export ftp_Rfam_full_region.sql:

export MYSQL_PWD=rfam_live_password

mysql -u -h -P –database rfam_live < sql/ftp_rfam_full_reqion.sql > /path/to/Rfam.full_region

gzip Rfam.full_region

Generate Rfam.pdb file

Export ftp_rfam_pdb.sql:

export MYSQL_PWD=rfam_live_password

mysql -u <user> -h <host> -P <port> --database rfam_live < sql/ftp_rfam_pdb.sql > /path/to/Rfam.pdb

gzip Rfam.pdb

Generate Rfam.clanin file

Generate a new Rfam.clanin file using clanin_file_generator.py:

python clanin_file_generator.py /path/to/destination/directory

Generate database_files folder

The option –tab of mysqldump is used to generate dumps in both .sql and .txt formats, where:

table.sql includes the MySQL query executed to create a table

table.txt includes the table contents in tabular format

The main rfam_live database is running with the secure-file-priv setting so it is not possible to export using the –tab option directly. A workaround is to copy dump files on a laptop, and use a local MySQL database for export.

Create a new database dump using mysqldump:

    mysqldump -u <user> -h <host> -P <port> -p --single-transaction --add-locks --lock-tables --add-drop-table --dump-date --comments --allow-keywords --max-allowed-packet=1G --tab=/tmp/database_files rfam_local  

Run the database_file_selector.py python script to create the subset of database_files available on the FTP:

    cd /releaseX/database_files     
    gzip *.txt     
    python database_file_selector.py --source-dir /tmp --dest-dir /releaseX/database_files

Zip database_files directory and copy to remote server:

tar -czvf database_files.tar.gz /releaseX/database_files     scp database_files.tar.gz username@remote.host:/some/location    

Restore database_files on FTP

tar -xzvf /some/location/database_files.tar.gz .    

Generate fasta_files folder

Export new fasta files for all families in Rfam using fasta_file_generator.py:

bsub -M 10000 -o fasta_generator.lsf.out -e fasta_generator.lsf.err "python fasta_file_generator.py --seq-db /path/to/rfamseq.fa --rfam-seed /path/to/releaseX/Rfam.seed --all --outdir /path/to/ftp/fasta_files"

Generate new data dumps

Requirements:

The directory for the text search index must have the following structure:

release_note.txt

families

clans

genomes

motifs

full_region

Move to the main rfam-production repo and setup the django environment:

source django_settings.sh    

Create an output directory for a new release index and all required subdirectories:

cd /path/to/relX_text_search     

mkdir clans     

mkdir families     

mkdir full_region     

mkdir genomes     

mkdir motifs  

Create new XML dumps:

cd /path/to/relX_text_search     

python rfam_xml_dumper.py --type F --out families     

python rfam_xml_dumper.py --type C --out clans     

python rfam_xml_dumper.py --type M --out motifs     

python rfam_xml_dumper.py --type G --out genomes     

python rfam_xml_dumper.py --type R --out full_region   

For more information on launching XML dumps as LSF jobs see lsf_rfam_xml_dumper.sh.

Validate xml dumps using xml_validator.py:

python xml_validator.py --input /path/to/relX_text_search/families --log     

python xml_validator.py --input /path/to/relX_text_search/clans --log     

python xml_validator.py --input /path/to/relX_text_search/motifs --log     

python xml_validator.py --input /path/to/relX_text_search/genomes --log     

python xml_validator.py --input /path/to/relX_text_search/full_region --log   

Check that error.log files in each folder is empty.

Get the total number of entries:

grep -r 'entry id' . | wc -l    

Create release_note.txt file:

release=14.5     release_date=15-Mar-2021     entries=2949678    

Index data on dev and prod

Change directory to text_search on the cluster:

cd_main && cd search_dumps

Index data on dev:

unlink rfam_dev

ln -s /path/to/xml/data/dumps rfam_dev

Index data on prod:

unlink current_release

ln -s /path/to/xml/data/dumps current_release

The files in rfam_dev and current_release folders are automatically indexed every night.

Create new json dumps for RNAcentral

Create a new region export from Rfam using Rfam2RNAcentral.pl:

Extract SEED regions

perl Rfam2RNAcentral.pl SEED > /path/to/relX/rnacentral/dir/rfamX_rnac_regions.txt    

Extract FULL region

perl Rfam2RNAcentral.pl FULL >> /path/to/relX/rnacentral/dir/rfamX_rnac_regions.txt    

Split regions into smaller chunks using basic linux split comman

mkdir chunks && cd chunks && split -n 3000 /path/to/relX/rnacentral/dir/rfamX_rnac_regions.txt rnac_ –additional-suffix=’.txt’

-n: defines the number of chunks to generate (2000 limit on LSF)

Notes:

  • This command will generate 3000 files named like rnac_zbss.txt

  • Use -l 500 option for more efficient chink size

Create a copy of the fasta files directory

mkdir fasta_files && cd fasta_files && wget http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/* .
NOTE: Rfam2RNAcentral regions need to match the exact same release the fasta_files were created for. Use the Public MySQL database if rfam-live have been updated

Unzip all fasta files and index using esl-sfetch:

gunzip * .gz for file in ./RF*.fa; do esl-sfetch –index $file; done

Copy the Rfam.seed file from the FTP to the fasta_files directory and index using esl-sfetch`:

wget http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.seed.gz && gunzip Rfam.seed.gz esl-sfetch –index Rfam.seed

Launch a new json dump using rnac2json.py:

  • fasta file directory: Create a new fasta files directory

  • json files directory: Create a new json files output directory

python rnac2json.py –input /path/to/chunks –rfam-fasta /path/to/fasta_files –outdir /path/to/json_files