Creating a command-line Python app with Click
2015-07-04
This tutorial demonstrates how to add a command-line interface to a script to turn it into a CLI utility program.
As a simple example, let’s write a script to convert a DNA sequence file from one format to another. We’ll actually just call Biopython to do the conversion for us.
First, let’s create a dummy
EMBL file test.embl
to use for testing:
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
record = SeqRecord(
Seq("ACGT", IUPAC.unambiguous_dna),
id="1",
name="A",
description="A sp. genome",
annotations={"organism": "A sp."},
)
with open("test.embl", "w") as f:
f.write(record.format("embl"))
Simple script
# embl2fasta_v1.py
import sys
from Bio import SeqIO
embl_file = sys.argv[1]
fasta_file = sys.argv[2]
SeqIO.convert(embl_file, "embl", fasta_file, "fasta")
Test the script by running
$ python embl2fasta_v1.py test.embl test.fasta
test.fasta
should look like this:
>1 A sp. genome
ACGT
Using functions
In this case, the script is super-simple. But usually, it is more useful to package the code up into a function, since you can then factor out anything that needs repeating. It also allows the code to be reused in other scripts.
# embl2fasta_v2.py
from Bio import SeqIO
def embl2fasta(embl_file, fasta_file):
"""Convert EMBL_FILE to FASTA_FILE."""
SeqIO.convert(embl_file, "embl", fasta_file, "fasta")
if __name__ == "__main__":
import sys
embl_file = sys.argv[1]
fasta_file = sys.argv[2]
embl2fasta(embl_file, fasta_file)
__name__
is "__main__"
only when the
script is run from the command line. This means that the last block
will not be executed if this script is imported by another script
(e.g. from embl2fasta import embl2fasta
).
Argument parsing
Let’s add some code to check the input.
Approach 1: Look before you leap
This approach is common in languages like R and (apparently) C.
# embl2fasta_v3.1.py
from Bio import SeqIO
def embl2fasta(embl_file, fasta_file):
"""Convert EMBL_FILE to FASTA_FILE."""
SeqIO.convert(embl_file, "embl", fasta_file, "fasta")
if __name__ == "__main__":
import sys
if len(sys.argv) != 3:
sys.exit("Error: Provide input and output file names.")
else:
# Unpack the arguments provided.
script, embl_file, fasta_file = sys.argv
embl2fasta(embl_file, fasta_file)
The disadvantage is that the len()
,
!=
, and if
operations are executed every
time the script is run, whether the input was actually correct or
not.
Approach 2: It’s better to beg forgiveness than to ask permission
It is considered more Pythonic to use a try/except
block, since the extra code is only run if an error gets thrown. In
other words, the code is more efficient when the input is assumed
to be correct, and efficiency doesn’t matter when the input is
wrong anyway.
# embl2fasta_v3.2.py
from Bio import SeqIO
def embl2fasta(embl_file, fasta_file):
"""Convert EMBL_FILE to FASTA_FILE."""
SeqIO.convert(embl_file, "embl", fasta_file, "fasta")
if __name__ == "__main__":
import sys
try:
script, embl_file, fasta_file = sys.argv
except ValueError:
sys.exit("Error: Provide input and output file names.")
embl2fasta(embl_file, fasta_file)
Approach 3: Use an argument parser
We could use the argparse
module in the standard
library, but it would take at least five lines to set up. The
Click package is more
user-friendly. First install it with pip install
click
.
# embl2fasta_v3.3.py
from Bio import SeqIO
import click
@click.command()
@click.argument("embl_file")
@click.argument("fasta_file")
def embl2fasta(embl_file, fasta_file):
"""Convert EMBL_FILE to FASTA_FILE."""
SeqIO.convert(embl_file, "embl", fasta_file, "fasta")
if __name__ == "__main__":
embl2fasta()
Calling this script with the wrong number of arguments now prints an informative usage message.
Calling it as python embl2fasta_v3.3.py --help
will
print the following:
Usage: embl2fasta_v3.3.py [OPTIONS] EMBL_FILE FASTA_FILE
Convert EMBL_FILE to FASTA_FILE.
Options:
--help Show this message and exit.
A generic converter
We can now easily add some options to allow conversion between various formats, so that we don’t need to write a separate script for conversion from Genbank format, for example. (The function and parameter names should be updated too.)
# convert_seq_v1.py
from Bio import SeqIO
import click
@click.command()
@click.argument("in_file")
@click.argument("out_file")
@click.option("-f", "--in-format", default="embl", show_default=True)
@click.option("-t", "--out-format", default="fasta", show_default=True)
def convert_seq(in_file, in_format, out_file, out_format):
"""Convert IN_FILE in IN_FORMAT to OUT_FILE in OUT_FORMAT."""
SeqIO.convert(in_file, in_format, out_file, out_format)
if __name__ == "__main__":
convert_seq()
The help message now reads as follows:
Usage: convert_seq.py [OPTIONS] IN_FILE OUT_FILE
Convert IN_FILE in IN_FORMAT to OUT_FILE in OUT_FORMAT.
Options:
-f, --in-format TEXT [default: embl]
-t, --out-format TEXT [default: fasta]
--help Show this message and exit.
Try running, e.g.
$ python convert_seq.py -t genbank test.embl test.genbank
But what happens if an invalid format is used? We could either specify a list of accepted formats, or handle the errors.
Click can take a list of valid choices for options:
SEQ_FORMATS = ("fasta", "fastq", "embl", "genbank")
@click.option("-f", "--in-format", default="embl", show_default=True,
type=click.Choice(SEQ_FORMATS))
@click.option("-t", "--out-format", default="fasta", show_default=True,
type=click.Choice(SEQ_FORMATS))
However, if Bio.SeqIO
starts supporting additional
formats, this list would have to be updated manually.
Instead, we could allow any input into the script and catch
Bio.SeqIO
’s own error:
import sys
def convert_seq(in_file, in_format, out_file, out_format):
try:
SeqIO.convert(in_file, in_format, out_file, out_format)
except ValueError as err:
sys.exit("Error: %s" % e)
Let’s assume we know all the formats we expect to deal with, and go with the first option.
Logging
Finally, let’s have the script produce some status messages to
stderr
. We’ll use the logging
module in the
standard library to produce the log and click.style()
to
colourize the output.
Here is the final script:
# convert_seq_v2.py
import logging
import sys
from Bio import SeqIO
import click
SEQ_FORMATS = ("fasta", "fastq", "embl", "genbank")
logging.basicConfig(
level=logging.INFO,
datefmt="%Y-%m-%d %X",
format="%(asctime)s %(levelname)s %(message)s",
)
@click.command()
@click.argument("in_file")
@click.argument("out_file")
@click.option("-f", "--in-format", type=click.Choice(SEQ_FORMATS),
default="embl", show_default=True)
@click.option("-t", "--out-format", type=click.Choice(SEQ_FORMATS),
default="fasta", show_default=True)
def convert_seq(in_file, in_format, out_file, out_format):
"""Convert IN_FILE in IN_FORMAT to OUT_FILE in OUT_FORMAT."""
logging.info(
click.style("Converting %s from %s to %s", fg="green"),
in_file, in_format, out_format,
)
try:
SeqIO.convert(in_file, in_format, out_file, out_format)
except Exception as err:
logging.error(click.style("%s", fg="red"), err)
sys.exit()
else:
logging.info(
click.style("Output written to %s", fg="green"),
out_file,
)
if __name__ == "__main__":
convert_seq()
To trigger the error handling, try converting our dummy EMBL, which doesn’t contain any quality values, into FastQ format:
$ python convert_seq_v2.py -t fastq test.embl test.fq
2015-07-04 07:35:27 INFO Converting test.embl from embl to fastq
2015-07-04 07:35:27 ERROR No suitable quality scores found in ↩
↪ letter_annotations of SeqRecord (id=1).