
util.fasta
Functions for Reading FASTA Files and Downloading from UniProt
Search the header lines of a FASTA file, read protein sequences from a file, count numbers of amino acids in each sequence, and download sequences from UniProt.
Usage
Arguments
- file
character, path to FASTA file
- iseq
numeric, which sequences to read from the file
- ret
character, specification for type of return (count, sequence, or FASTA format)
- lines
list of character, supply the lines here instead of reading them from file
- ihead
numeric, which lines are headers
- start
numeric, position in sequence to start counting
- stop
numeric, position in sequence to stop counting
- type
character, sequence type (protein or DNA)
- id
character, value to be used for in output table
- seq
character, amino acid sequence of a protein
- protein
character, entry name for protein in UniProt
Details
is used to retrieve entries from a FASTA file. Use to select the sequences to read (the default is all sequences). The function returns various formats depending on the value of . The default returns a data frame of amino acid counts (the data frame can be given to in order to add the proteins to ), returns a list of sequences, and returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). If the line numbers of the header lines were previously determined, they can be supplied in . Optionally, the lines of a previously read file may be supplied in (in this case no file is needed so should be set to ""). When is , the names of the proteins in the resulting data frame are parsed from the header lines of the file, unless is provided. If id is not given, and a UniProt FASTA header is detected (regular expression ), information there (accession, name, organism) is split into the , , and organism columns of the resulting data frame.
counts the occurrences of each amino acid or nucleic-acid base in a sequence (). For amino acids, the columns in the returned data frame are in the same order as . The matching of letters is case-insensitive. A warning is generated if any character in , excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations. and/or can be provided to count a fragment of the sequence (extracted using ). If only one of or is present, the other defaults to 1 () or the length of the sequence ().
returns a data frame of amino acid composition, in the format of , retrieved from the protein sequence if it is available from UniProt (http://uniprot.org). The argument corresponds to the on the UniProt search pages.
Value
returns a list of sequences or lines (for equal to or , respectively), or a data frame with amino acid compositions of proteins (for equal to ) with columns corresponding to those in .
See Also
, like , counts amino acids in a user-input sequence, but returns a data frame in the format of . for an example of counting nucleobases in a DNA sequence.
Aliases
- util.fasta
- read.fasta
- uniprot.aa
- count.aa
-