Release Legacy Steps
These steps may be useful for reference.
Generate annotated SEED files
Export SEED files from SVN using generate_ftp_files.py:
to export all files (recommended 16GB RAM to checkout large families):
python generate_ftp_files.py --acc all --seed --dest-dir /path/to/seed/files/dest/dir
to export accessions listed in a file
python generate_ftp_files.py -f /path/to/rfam_accession_list.txt --seed --dest-dir /path/to/seed/files/dest/dir
Alternatively, use writeAnnotatedSeed.pl:
perl writeAnnotatedSeed.pl RFXXXXX
Generate annotated CM files
Export plain CM files using generate_ftp_files.py:
to export all files (recommended 16GB RAM to checkout large families)
python generate_ftp_files.py --acc all --cm --dest-dir /path/to/seed/files/dest/dir
to export accessions listed in a file python
generate_ftp_files.py -f /path/to/rfam_accession_list.txt --cm --dest-dir /path/to/CM/files/dest/dir
alternatively use writeAnnotatedCM.pl:
perl writeAnnotatedCM.pl RFXXXXX
Rewrite CM file and descriptions from SEED using seed-desc-to-cm.pl:
Required files:
$CM_no_desc - a CM file to add DESC to (could be 1 CM or all CMs in a single file)
$SEED_with_DESC - a seed file with DESC lines (could be 1 seed or all seeds in a single file)
filter out DESC lines to avoid duplicates as some CMs already have DESC lines
grep -v DESC $CM_no_desc > Rfam.nodesc.cm
perl seed-desc-to-cm.pl $SEED_with_DESC Rfam.nodesc.cm > Rfam.cm
check that Rfam.cm contains the correct number of families
cmstat Rfam.cm | grep -v '#' | wc -l
check the number of DESC lines - should be 2 * number of families
grep DESC Rfam.cm | wc -l
generate the final archive
gzip -c Rfam.cm > Rfam.cm.gz
split annotated Rfam.cm into individual .cm files and create Rfam.tar.gz:
delete any existing files
rm -f RF0*.cm
get a list of Rfam IDs found in Rfam.cm
grep ACC Rfam.cm | sed -e 's/ACC\s+//g' | sort | uniq > list.txt
create an index file to speed up cm retrieval
cmfetch --index Rfam.cm
create individual cm files
while read p; do echo $p; cmfetch Rfam.cm $p > "$p.cm" ; done <list.txt
create an archive
tar -czvf Rfam.tar.gz RF0*.cm
remove temporary files
rm RF0*.cm
The CM file is used for PDB mapping. The SEED and CM files are stored in FTP archive and are also available from the Rfam website.
Load annotated SEED and CM files into rfam_live
This enables the SEED and CM download directly from the Rfam website. Requires new Rfam.cm and Rfam.seed files.
Use load_cm_seed_in_db.py:
python load_cm_seed_in_db.py /path/to/Rfam.seed /path/to/Rfam.cm
Alternatively, use populateAnnotatedFiles.pl to process families one by one:
perl populateAnnotatedFiles.pl RFXXXXX /path/to/RFXXXXX.cm /path/to/RFXXXXX.seed
Clan competition
Clan competition is a quality assurance measure ran as a pre-processing release step aiming to reduce redundant hits of families belonging to the same clan.
Create a destination directory for clan competition required files
mkdir ~/releaseX/clan_competition mkdir ~/releaseX/clan_competition/sorted
Generate clan files using clan_file_generator.py:
recommended 16GB RAM
python clan_file_generator.py --dest-dir ~/releaseX/clan_competition --clan-acc CL00001 --cc-type FULL
–cc-type: Clan competition type FULL/PDB
–clan-acc: Clan accession to compete
–all: Compete all Rfam clans
-f: Only compete clans in file
Sort clan files in dest-dir based on rfamseq_acc (col2) using linux sort command and store then in the sorted directory:
sort -k2 -t $'\t' clan_file.txt > ~/releaseX/clan_competition/sorted/clan_file_sorted.txt
or for multiple files cd ~/releaseX/clan_competition and run:
for file in ./CL*; do sort -k2 -t $'\t' ${file:2:7}.txt > sorted/${file:2:7}_s.txt; done
Run clan competition using clan_competition.py:
python clan_competition.py --input ~/releaseX/clan_competition/sorted --full
–input: Path to ~/releaseX/clan_competition/sorted
–pdb: Type of hits to compete PDB (pdb_full_region table)
–full: Type of hits to compete FULL (full_region table)
Prepare rfam_live for a new release
Populate rfam_live tables using populate_rfamlive_for_release.py:
python populate_rfamlive_for_release.py –all
Make keywords using make_rfam_keywords_table.pl:
perl make_rfam_keywords_table.pl
Update taxonomy_websearch table using updateTaxonomyWebsearch.pl (:warning: requires cluster access):
perl updateTaxonomyWebsearch.pl
Running view processes
:warning: Requires cluster access and updating the PDB fasta file. Note that using PDB view processes plugin will be discontinued in favour of regular updates of the entire pdb_full_region table.
Create a list of tab separated family accessions and their corresponding uuids using the following query:
select rfam_acc, uuid from _post_process where status='PEND';
Update the PDB sequence file and the path in the PDB plugin
Launch view processes on the EBI cluster:
python job_dequeuer.py --view-list /path/to/rfam_uuid_pairs.tsv --dest-dir /path/to/destination/directory
–view-list: A list with tab separated rfam_acc, uuid pairs to run view processes on –dest-dir: The path to the destination directory to generate shell scripts and log to
Update PDB mapping
This step requires a finalised Rfam.cm file with the latest families, including descriptions (see FTP section for instructions).
1. Get PDB sequences in FASTA format
wget ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
gunzip pdb_seqres.txt.gz
2. Split into individual sequences
perl -pe "s/\>/\/\/\n\>/g" < pdb_seqres.txt > pdb_seqres_sep.txt
3. Remove protein sequences using collateSeq.pl:
perl collateSeq.pl pdb_seqres_sep.txt
mv pdb_seqres_sep.txt.noprot pdb_seqres_sep_noprot.fa
4. Remove extra text from FASTA descriptor line
awk '{print $1}' pdb_seqres_sep_noprot.fa > pdb_trimmed_noprot.fa
5. Replace any characters that are not recognised by Infernal
sed -e '/^[^>]/s/[^ATGCURYMKSWHBVDatgcurymkswhbvd]/N/g' pdb_trimmed_noprot.fa > pdb_trimmed_noillegals.fa
6. Split into 100 files
mkdir files
seqkit split -O files -p 100 pdb_trimmed_noillegals.fa
7. Run cmscan in parallel
cd files
cmpress Rfam.cm
bsub -o part_001.lsf.out -e part_001.lsf.err -M 16000 "cmscan -o part_001.output --tblout part_001.tbl --cut_ga Rfam.cm pdb_trimmed_noillegals.part_001.fa"
...
bsub -o part_100.lsf.out -e part_100.lsf.err -M 16000 "cmscan -o part_100.output --tblout part_100.tbl --cut_ga Rfam.cm pdb_trimmed_noillegals.part_001.fa"
To check that all commands completed, check that every log file contains Successfully completed:
cat part_*lsf.out | grep Success | wc -l
8. Combine results
cat *.tbl | sort | grep -v '#' > PDB_RFAM_X_Y.tbl
9. Convert Infernal output to pdb_full_region table using infernal_2_pdb_full_region.py:
python infernal_2_pdb_full_region.py --tblout /path/to/pdb_full_region.tbl --dest-dir /path/to/dest/dir
The script will generate a file like pdb_full_region_YYYY-MM-DD.txt.
10. Create a new temporary table in rfam_live
create table pdb_full_region_temp like pdb_full_region;
Manually import the data in the .txt dump into rfam_live using Sequel Ace or similar
:warning: mysqlimport or LOAD DATA INFILE do not work with rfam_live because of secure-file-priv MySQL setting.
Examine the newly imported data and compare with pdb_full_region
Rename pdb_full_region to pdb_full_region_old and rename pdb_full_region_temp to pdb_full_region
Clan compete the hits as described under Clan competition section using the –PDB option
List new families with 3D structures
number of families with 3D before select count(distinct rfam_acc) from pdb_full_region_old where is_significant = 1;
number of families with 3D after select count(distinct rfam_acc) from pdb_full_region where is_significant = 1;
new families with 3D select distinct rfam_acc from pdb_full_region where is_significant = 1 and rfam_acc not in (select distinct rfam_acc from pdb_full_region_temp where is_significant = 1);
Stage RfamLive for a new release
Create MySQL dumps using mysqldump:
Create a new MySQL dump to replicate the database on REL and PUBLIC servers:
export MYSQL_PWD=rfam_live_password
bsub -o mysqldump.out -e mysqldump.err "mysqldump -u <user> -h <hostname> -P <port> --single-transaction --add-locks --lock-tables --add-drop-table --dump-date --comments --allow-keywords --max-allowed-packet=1G rfam_live > rfam_live_relX.sql"
To create a MySQL dump of a single table:
mysqldump -u username -h hostname -P port -p --single-transaction --add-locks --lock-tables --add-drop-table --dump-date --comments --allow-keywords --max-allowed-packet=1G database_name table_name > table_name.sql
Restore a MySQL database instance on a remote server
2. Move to the directory where a new MySQL dump is located
cd /path/to/rfam_live_relX.sql
3. Connect to the MySQL server (e.g. PUBLIC)
mysql -u username -h hostname -P port -p
4. Create a new MySQL Schema
Create schema rfam_X_Y;
5. Select schema to restore MySQL dump
Use rfam_X_Y;
6. Restore the database
source rfam_live_relX.sql
Generate FTP files
Generate annotated tree files
Generate new tree files for the release using generate_ftp_files.py:
to export all files (recommended 16GB RAM to checkout large families) python generate_ftp_files.py –acc all –tree –dest-dir /path/to/tree/files/dest/dir
to export accessions listed in a file python generate_ftp_files.py -f /path/to/rfam_accession_list.txt –tree –dest-dir /path/to/tree/files/dest/dir
alternatively use writeAnnotatedTree.pl:
perl writeAnnotatedTree.pl RFXXXXX
Generate rfam2go files
Create a new rfam2go export by running rfam2go.pl:
rfam2go.pl > /path/to/dest/dir/rfam2go
Add a header line to rfam2go with release and date.
!version date: <YYYY/MM/DD>
!description: A mapping of GO terms to Rfam release <XX.X>.
!external resource: https://rfam.org/
!citation: Kalvari et al. (2020) Nucl. Acids Res. 49: D192-D200
!contact: rfam-help@ebi.ac.uk
Create md5 checksum of rfam2go file:
cd rfam2go && md5sum * > md5.txt
Run GO validation and review warnings using validate_rfam2go.pl:
perl validate_rfam2go.pl goselect selectgo rfam2go
Generate Rfam.full_region file
Export ftp_Rfam_full_region.sql:
export MYSQL_PWD=rfam_live_password
mysql -u
gzip Rfam.full_region
Generate Rfam.pdb file
Export ftp_rfam_pdb.sql:
export MYSQL_PWD=rfam_live_password
mysql -u <user> -h <host> -P <port> --database rfam_live < sql/ftp_rfam_pdb.sql > /path/to/Rfam.pdb
gzip Rfam.pdb
Generate Rfam.clanin file
Generate a new Rfam.clanin file using clanin_file_generator.py:
python clanin_file_generator.py /path/to/destination/directory
Generate database_files folder
The option –tab of mysqldump is used to generate dumps in both .sql and .txt formats, where:
table.sql includes the MySQL query executed to create a table
table.txt includes the table contents in tabular format
The main rfam_live database is running with the secure-file-priv setting so it is not possible to export using the –tab option directly. A workaround is to copy dump files on a laptop, and use a local MySQL database for export.
Create a new database dump using mysqldump:
mysqldump -u <user> -h <host> -P <port> -p --single-transaction --add-locks --lock-tables --add-drop-table --dump-date --comments --allow-keywords --max-allowed-packet=1G --tab=/tmp/database_files rfam_local
Run the database_file_selector.py python script to create the subset of database_files available on the FTP:
cd /releaseX/database_files
gzip *.txt
python database_file_selector.py --source-dir /tmp --dest-dir /releaseX/database_files
Zip database_files directory and copy to remote server:
tar -czvf database_files.tar.gz /releaseX/database_files scp database_files.tar.gz username@remote.host:/some/location
Restore database_files on FTP
tar -xzvf /some/location/database_files.tar.gz .
Generate fasta_files folder
Export new fasta files for all families in Rfam using fasta_file_generator.py:
bsub -M 10000 -o fasta_generator.lsf.out -e fasta_generator.lsf.err "python fasta_file_generator.py --seq-db /path/to/rfamseq.fa --rfam-seed /path/to/releaseX/Rfam.seed --all --outdir /path/to/ftp/fasta_files"
Update Rfam text search
Description of indexed fields: https://www.ebi.ac.uk/ebisearch/metadata.ebi?db=rfam
Generate new data dumps
Requirements:
The directory for the text search index must have the following structure:
release_note.txt
families
clans
genomes
motifs
full_region
Move to the main rfam-production repo and setup the django environment:
source django_settings.sh
Create an output directory for a new release index and all required subdirectories:
cd /path/to/relX_text_search
mkdir clans
mkdir families
mkdir full_region
mkdir genomes
mkdir motifs
Create new XML dumps:
cd /path/to/relX_text_search
python rfam_xml_dumper.py --type F --out families
python rfam_xml_dumper.py --type C --out clans
python rfam_xml_dumper.py --type M --out motifs
python rfam_xml_dumper.py --type G --out genomes
python rfam_xml_dumper.py --type R --out full_region
For more information on launching XML dumps as LSF jobs see lsf_rfam_xml_dumper.sh.
Validate xml dumps using xml_validator.py:
python xml_validator.py --input /path/to/relX_text_search/families --log
python xml_validator.py --input /path/to/relX_text_search/clans --log
python xml_validator.py --input /path/to/relX_text_search/motifs --log
python xml_validator.py --input /path/to/relX_text_search/genomes --log
python xml_validator.py --input /path/to/relX_text_search/full_region --log
Check that error.log files in each folder is empty.
Get the total number of entries:
grep -r 'entry id' . | wc -l
Create release_note.txt file:
release=14.5 release_date=15-Mar-2021 entries=2949678
Index data on dev and prod
Change directory to text_search on the cluster:
cd_main && cd search_dumps
Index data on dev:
unlink rfam_dev
ln -s /path/to/xml/data/dumps rfam_dev
Index data on prod:
unlink current_release
ln -s /path/to/xml/data/dumps current_release
The files in rfam_dev and current_release folders are automatically indexed every night.
Create new json dumps for RNAcentral
Create a new region export from Rfam using Rfam2RNAcentral.pl:
Extract SEED regions
perl Rfam2RNAcentral.pl SEED > /path/to/relX/rnacentral/dir/rfamX_rnac_regions.txt
Extract FULL region
perl Rfam2RNAcentral.pl FULL >> /path/to/relX/rnacentral/dir/rfamX_rnac_regions.txt
Split regions into smaller chunks using basic linux split comman
mkdir chunks && cd chunks && split -n 3000 /path/to/relX/rnacentral/dir/rfamX_rnac_regions.txt rnac_ –additional-suffix=’.txt’
-n: defines the number of chunks to generate (2000 limit on LSF)
Notes:
This command will generate 3000 files named like rnac_zbss.txt
Use -l 500 option for more efficient chink size
Create a copy of the fasta files directory
mkdir fasta_files && cd fasta_files && wget http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/* .
NOTE: Rfam2RNAcentral regions need to match the exact same release the fasta_files were created for.
Use the Public MySQL database if rfam-live have been updated
Unzip all fasta files and index using esl-sfetch:
gunzip * .gz for file in ./RF*.fa; do esl-sfetch –index $file; done
Copy the Rfam.seed file from the FTP to the fasta_files directory and index using esl-sfetch`:
wget http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.seed.gz && gunzip Rfam.seed.gz esl-sfetch –index Rfam.seed
Launch a new json dump using rnac2json.py:
fasta file directory: Create a new fasta files directory
json files directory: Create a new json files output directory
python rnac2json.py –input /path/to/chunks –rfam-fasta /path/to/fasta_files –outdir /path/to/json_files