The Variability of Amino Acid Sequences in Hepatitis B Virus

Jianhao Cao; Shuhong Luo; Yuanyan Xiong

doi:10.1007/s12250-018-0070-x

Hepatitis B virus (HBV) is an important human pathogen belonging to the Hepadnaviridae family, Orthohepadnavirus genus. Over 240 million people are infected with HBV worldwide. The reverse transcription during its genome replication leads to low fidelity DNA synthesis, which is the source of variability in the viral proteins. To investigate the variability quantitatively, we retrieved amino acid sequences of 5, 167 records of all available HBV genotypes (A–J) from the Genbank database. The amino acid sequences encoded by the open reading frames (ORF) S/C/P/X in the HBV genome were extracted and subjected to alignment. We analyzed the variability of the lengths and the sequences of proteins as well as the frequencies of amino acids. It comprehensively characterized the variability and conservation of HBV proteins at the level of amino acids. Especially for the structural proteins, hepatitis B surface antigens (HBsAg), there are potential sites critical for virus assembly and immune recognition. Interestingly, the preS1 domains in HBsAg were variable at some positions of amino acid residues, which provides a potential mechanism of immune-escape for HBV, while the preS2 and S domains were conserved in the lengths of protein sequences. In the S domain, the cysteine residues and the secondary structures of the alpha-helix and beta-sheet were likely critical for the stable folding of all HBsAg components. Also, the preC domain and C-terminal domain of the core protein are highly conserved. However, the polymerases (HBpol) and the HBx were highly variable at the amino acid level. Our research provides a basis for understanding the conserved and important domains of HBV viral proteins, which could be potential targets for anti-virus therapy.

HTML

Materials and Methods

HBV Sequence Acquisition and Alignment

In November 2017, a total of 5, 167 HBV genome records with confirmed genotypes were retrieved from the Genbank Nucleotide Database of National Center for Biotechnology Information (NCBI). The whole dataset was divided into different categories according to areas, China, SoutheastAsia (SE-Asia, including Indonesia, Malaysia, Myanmar, South Korea, Thailand, Vietnam, Japan and Korea), America (including Argentina, Brazil, Canada, Chile, Colombia, Mexico, USA and Venezuela), Europe (including Belgium, France, Germany, Ireland, Italy, Luxembourg, Netherlands, Poland, Russia, Serbia, Spain, Sweden, Turkey and UK), and the other area (including India, Iran, Saudi Arabia and Syria). Then, the amino acid sequences of 4 HBV ORFs (S, C, P and X) were extracted according to their start and end sites and were then subjected to sequence alignment with Clustal Omega (Sievers et al. 2011).

Subsequent analysis was focused on those functional sequences or domains annotations. The sequences labelled with "nonfunctional" or "truncated" were discarded. The final items for analysis were 3, 208 sequences of ORF C, 3, 064 sequences of ORF S, 3, 701 sequences of ORF P and 4, 265 sequences for ORF X. The domains for further analysis were the preS1 receptor binding domain (preS1-RBD), the preS1 domain, the preS2 domain, the S domain, the preC domain, the core protein assembly domain (core-AD, usually 149AAs) and the core protein C-terminal domain (core-CTD, also known as the arginine-rich domain, ARD). The positions of the preS1-RBD is according to the sequence previously reported (Yan et al. 2012). The preS1/S2/S domains are encoded by ORF S, while the preC, core-AD and core-CTD are encoded by ORF C. The ORF C and S encode the structural proteins of HBV which are important antigens and were selected for further analysis since they may provide basis for the development of new vaccines.

Features of HBV Sequences and Domains

We analyzed the variability of sequences from different aspects. All domains were extracted from aligned sequences. Aligned sequences with only gaps across the domain were discarded in subsequent analysis. Hence, the remaining items are the total number of sequences for calculation of the percentage of different indices mentioned below. Among all the sequences or domains, the most frequent one was defined as the predominant sequence, which represents the conserved sequence. Sequences or domains with any differences in amino acid sequence were defined as unique sequences. Similarly, the most frequent length of sequences or domains was defined as the predominant one. The ratio of sequences with the predominant length suggests the length conservation of a sequence or domain. The length variability is measured by calculation of number of sequence lengths, which means how many kinds of length a sequence or domain could adopt.

Statistics of the Frequency of Amino Acids of Each Site in Sequences

After sequence alignment, we computed the proportion of amino acids residues (including gaps) for each position as below.

The proportion was the number of sequences with specified amino acid divided by the total number of available sequences. We defined the predominant amino acid residue as the one with the highest proportion at that site and evaluated the levels of sequence homology to the predominant residue.

Prediction of the Secondary Structure of HBsAg

A representative sequence record (Accession number of its genome record: AM295797) was subjected to the prediction of secondary structure of the large/middle/small HBsAg. The prediction was performed in PSIPRED website (http://bioinf.cs.ucl.ac.uk/psipred/) (Jones 1999).

Analysis of Similarities of Pair-Wise Sequences

The pair-wise sequence's similarity was calculated based on the aligned sequences. The similarity was defined as the ratio of the number of positions with identical amino acid residues to the length of the sequences. The similarity profile was shown as a histogram, and the total ratio of pair-wise similarity in each panel was 1. Pearson correlation coefficient (R) was computed between two profiles using Python SciPy library (https://www.scipy.org). R > 0.8 was regarded as correlation, while R > 0.9 was regarded as high correlation.

Discussion

Although HBV carriers in Africa account for almost 25% of those in the world (See Global hepatitis report, WHO 2017), the data about Africa is very few in the Genbank, which is the drawback of our study (Table 1). It also reflects the lack of HBV research in Africa. However, most of HBV sequences come from China. It suggests that China is still a major epidemic area and pays more attention to the research of HBV. The amino acid sequence is a direct determinant of the folding of proteins. Furthermore, during the low-fidelity replication of the HBV genome by reverse transcription, a high degree of sequence variation results. Thus, it's necessary to have a comprehensive investigation on the amino acid sequences of HBV proteins, which explores the variability and conservation of domains that determine their function, as well as differences among different geographical strains.

Our study revealed the variability of HBV ORF S, C, P and X at the amino acid level. It showed the sequence variability both in length and amino acid sequences. The variability of virus proteins doesn't seem to hamper the normal functions. The four ORFs of HBV are partially overlapping. When a mutation happens in an ORF, it also likely happens in another one. It will affect either the function or the transcription of viral proteins. Hence, the conserved regions in the HBV genome are functionally important. However, it also reveals functional flexibility in highly variable regions.

Importantly, the variability of the HBsAg at the level of amino acid provides many potential epitopes for immune recognition. This is a potential mechanism exhausting the immunocytes or antibodies which recognize the HBsAg. It has been reported that the HBsAg in tubular SVPs organizes regularly in crystalline-like pattern (Short et al. 2009). As the C-terminal part of the HBsAg, the S domain is transmembrane and folds as the protrusion on the periphery (Bruss 2004; Short et al. 2009). Moreover, there were a total of 12 highly conserved cysteine residues in the S domain (Supplementary Figure S3A), far more than those in the core-AD of HBcAg (Supplementary Figure S3B). It indicates that these cysteine residues are probably critical for the stability of the protein structure, which could help the protrusions on SVPs to arrange in a regular way. Even the cysteine residues in HBcAg are not necessary for the disulfide bond (Yu et al. 2013). The core-CTD is located in the internal side of the HBV capsid, which helps the virus enclose the genome during assembly (Zlotnick et al. 1997). We postulate that those cysteine residues and the lengths of the S domain are very critical for the stability of the HBsAg structure, especially, since the S domain forms the transmembrane region. Furthermore, the N-terminal extension of the preS1-RBD was also a highly conserved sequence. It's probably associated with the function of receptor binding in some unknown way.

In conclusion, we studied the viral proteins of HBV at the level of amino acid. Quantitative investigation revealed the conservation and variability among different sequences and domains. The critical sequences for virus assembly, the small HBsAg and the core-AD, are the most conserved, as well as the preC domain. However, preS1 domain and HBpol show the highest variability by geographical location. It would be helpful to further study the variant epitopes of the HBsAg in immune escape and recognition of HBV and for the development of new vaccines and antiviral drugs.

Acknowledgements

The authors would like to thank Prof. Ping Zhu (Institute of biophysics, Chinese Academy of Sciences) and Prof. Jingqiang Zhang (Sun Yat-sen university), who provided help in this research. This work was partially supported by the National Natural Science Foundation of China (Nos. U1611265, 81773271 and 31672536) and the Key Projects of Department of Education of Guangdong Province (No. 2017KZDXM088). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author Contributions

JC, SL and YX designed the study. JC conducted computational work, JC and YX performed data analysis. JC, SL and YX wrote the manuscript draft.

Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interest.

Animal and Human Rights Statement

This article does not contain any studies with human or animal subjects performed by any of the authors.

Figure (7) Table (1) Reference (14) Relative (20)

The Variability of Amino Acid Sequences in Hepatitis B Virus

Abstract

References

Proportional views

Article Metrics

Related

Proportional views