Disclaimer: This is an example of a student written essay.
Click here for sample essays written by our professional writers.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Challenges and Countermeasures in Authenticating User in Voice Biometrics

Paper Type: Free Essay Subject: Computer Science
Wordcount: 4876 words Published: 8th Feb 2020

Reference this

Is That Your Voice?

Abstract— Biometric authentication [1] is a security process thatreliesontheuniquebiologicaltraitsofanindividualtoverify the person. There are many biometric authentication types such as fingerprint, facial recognition, retina scan, voice etc. Among these biometrics, the most famous ones include fingerprint, voice and facial recognition. In voice biometrics [2], system relies on characteristicsanddistinctionsofauser’svoice,itmeasuresthose distinctions and uniquely identify users. Instead of password, which can be forgotten, voice authentication allows people to use their voices themselves as passwords [2]. Methods used in voice authentication are speech synthesis and speech recognition. Voice authentication can be a great way to authenticate a user but they are prone to many vulnerabilities. Spoofing attack is one major vulnerability.Thispaperdealswiththeattacksonvoicebiometric andthecounter measurestakentoprotectuser’sinformationand identity.


Keywords— ASV (automatic speaker verification), VC (voice conversion), FAR (False Acceptance Rate), FRR (False Rejection Rate), GMM (Gaussian Mixture Model), anti-spoofing, vulnerability, PAD (presentation attack detection) authentication and biometric.



Voice is a strong medium of communication and so is the voice biometric. This biometric is highly used in mobile devices to access mobile banking and other facilities (also used in border control). Thus due to its wide area of applications, voice biometric has many vulnerabilities too. A vulnerability [3] generally refers to a security flaw due to program failure or a situation where security of a system is compromised. Security holes are typical example of vulnerability. One of the major voice biometric attack include “presentation attack”. Presentation attacks [4] are commonly known as spoofing, in present days. Presentation attacks can be defined as capturing human biometrics in such a way that it could interfere with intended motive of biometric systems, this can generally be achieved by posing as another individual by hiding the attacker’s identity, in short this can be described as “impersonating authentic user (target user)”. Impersonation is generally done to have desired authorization to reap benefits such as access to target users bank accounts and other financial information or this can be done just for fun too. In voice biometric impersonation means mimicking intended user’s voice. Another major attack in voice biometric is “replay attack”. Replay attack [5] is a type

of attack where an attacker detects the transmission of data and either has it delayed or replayed. This activity is generally carried out by the malicious user or entity. Replay attack is also known as play back attack. In voice biometrics replay attacks are carried out by recording the legitimate user voice and using it to access the system. Recording can be done using a mobile phone but playing the recorded content is done using high quality speakers in usual cases. In this paper we define such attacks, their detection and the counter measures which can be employed to avoid spoofing of original data and securing the user’s identity. Broadly this paper includes verification of speaker, anti-spoofing, examining accuracy, privacy issues (presentation attack detection).

Figure 1. Replay attack overview and description.


Figure 2. General overall description of our study in this paper

  1. Joint speaker verification and anti-spoofing [6]

Most of the biometric systems are commonly vulnerable to spoofing attacks and so is voice biometric (automatic speaker verification). Unless a countermeasure is employed these attacks will continue to happen and accept false biometrics for authentication such as in the case of replay and conversion attacks. Voice conversion attacks is considered as one of the most taxing and difficult attacks to detect even for modern recognition systems. Creating synthesis channel subspace to jointly perform speaker verification and anti-spoofing in the i-vector is done which is a well-known and tested technique   which helps in avoiding fusion techniques and enabling the amalgamation of speaker verification and anti-spoofing into one system. Even though some biometric techniques can handle noisy and mismatched samples comparison well, yet they are still vulnerable under malicious spoofing attacks. Most accusable weak links in a biometric system include sensor and transmission level attacks. In this study the system proposed is text independent speaker verification which is generally used in telephone banking to verify customer’s identity. Speaker verification system can be spoofed by 4 major types of attacks: replay, impersonation, speaker adapted speech synthesis and voice conversion. Speaker verification system is flexible in directing transformation of speaker characteristics which focus on voice conversions (VC) attacks. VC involves converting one attacker utterances (pronouncement) to the target speaker (client) thus giving rise to a grave threat to any speaker verification system, this problem continues even with modern recognizers enabled with i-vector. Interesting facts about this attack is that the converted voice does not have to be high, even artificial signal attack involving fuzzy speech can spoof a recognizer. So even if modern recognizers increase protection their false acceptance rate increases accordingly. There exists less countermeasures to safe guard recognizers from attacks. There are two sub problems involved, firstly spoofing attacks should be detected, an efficient speaker verification system measures the strength of target and non- target assumptions. And an efficient spoofing attack detector which must assess the strength of human and non-human assumptions, where latter refers to any impersonation of human speech or generation of synthetic speech, secondly amalgamating the speaker verification and counter measures is another problem. A broader view of countermeasures in anti-spoofing research aim at detecting different types of attacks such as replay or VC attacks. In this study, the quest for broader view of counter measures continues where same acoustic front end design is used for automatic speaker verification and focuses on general modeling of spoofing attacks instead of using discriminative feature to distinguish synthetic and natural pronouncements. At the end of the study assumption is made that any speech pronouncement irrespective of natural or fabricated can be presented as a single feature vector φ. A fabricated pronouncement is a disturbed version of natural human speech passed through a very particular synthesis channel. These synthesis channels add up another origin of variation beginning from spoofing attacks that utilize different conversion techniques. These are some speaker independent activities properties which are buzziness and discontinuity and such properties are reflected in Cepstral features and i-vectors derived from them, enable detection. A development pool of i-vectors derived from both natural and

fabricated pronouncements, independent subspace models for the corresponding natural and synthesis channel variation can be trained. For combined modeling of speaker and synthesis channel variation PLDA (Probabilistic linear discriminant analysis) is adopted.

Advantages of PLDA: Sharing the same front end, integrating speaker verification and countermeasures is known advantage, Joint modeling approach allows low computation complexity, being a generative model it caters good generalization and back end modeling is not related to any particular front end or biometric modality.

Employing of spoofing attack dataset, based on core task “1conv4w-1conv4w” of speaker recognition evaluation. In this data set there are 9440 gender matched trails for evaluation, among them 3946 are genuine trails, 2747 are imposter trails and 2747 are imposter trails after VC. Two VC methods are Joint density Gaussian mixture model (JD-GMM) and simplified frame selection (FS). JDGMM converts source features to match the target speaker while FS uses target speaker directly without any modification to generate converted speech. To study generalization ability of countermeasure, matched and mismatched spoof conditions are defined.

Matched: User has prior knowledge about vocoding techniques of VC attacks.

Unmatched: Assumes that the system designer is ready for a specific type of spoofing, but attacks are  actually different.

To evaluate ability of a system to resist attack especially the FAR (False acceptance rate).

For independent evaluation the countermeasure performance spoofing detecting error (SDE)


FP= samples erroneously classified as positives. FN= samples erroneously classified as negatives. P= total positive samples.

N=total negative samples.

  1. Voice authentication using short phrases [7]

Due to increase use of mobile phones for transaction which has voice authentication, this study involves a security model for voice biometric that classifies overall vulnerabilities into bits of “biometric identity security”, bits of “knowledge security” and bits of “traditional encryption security”. Here a text dependent voice biometric refers to what you know and who you are and whereas text independent component is further included to overcome replay attacks. A new method of Vaulted Voice Verification is introduced which can retain privacy while referring and eliminating issues of voice as a remote or mobile biometric identifier. If the time taken to verify and authenticate a user using voice biometric is more, it might cause inconvenience to the user thus in this study this concern is addressed by referring to number of bits that can be produced by a single user interaction. Text dependent voice template system is vulnerable to various VC attacks

and even text independent voice models are also vulnerable. Thus employing vaulted voice verification seeks to eliminate such vulnerabilities.

This study explains the integration of text independent speaker models into vaulted voice verification to improve security, improving security gains for every question asked, a security analysis of vaulted voice verification method in terms of attack models and determining the accuracy Vs security trade off from text- independent models.

This study extends vaulted voice verification past text- dependent system and single bit questions and answers and combining text dependent modeling where models are based on specific word snippet with text-independent modeling where models are produced utilizing large word grouping. Information gathered from each question is also extended from binary to multiple choice. Upon combining both text dependent and text- independent system, focus is on one small phrase grouping and these small phrases are collected based on images and are compiled to create more text independent models, the model proposed here are not completely text-independent and are also not text limited as they are based on the responses collected from the user, therefore the model relies on short phrases based text-independent models based on the amount of available phrases. In the previous model, users had to respond by repeating certain phrases thus the phrases were revealed during registration and also during verification, thus making the system susceptible to some predetermined phrases and facing a challenge in the security they provide. So in the improved model, users are shown images and thus asking to provide a short description of the image. Hence different users might give different description and when text dependent models are used here, the system utterly incorporating the word and the voice procedure or the system used in the answer thus enhancing security on the whole and creating models which have greater degrees of freedom. This implicitly combines what you posses with who you are, since this model exists only on mobile phones i.e. something a user owns it’s now paving a way for complete three factor authentication. This procedure uses text independent models to alleviate the vulnerability of replay attacks, here in replay attacks attacker records the audio responses and replays them later during the time of an attempted attack. To perform this, the server first produces a phrase, user reads it and then the server further produces both a text independent speech models as well as executes speech recognition to check the content. This way the phrase is not disclosed to the user until the server prompts or asks them, thus making a replay attack with the prerecorded voice of the user difficult and impractical. The security does not only rely on general text-independent model but also on spoken description. Second challenge is to get most information from the user without consuming much time and which might annoy them, this was the fault in previous model of vaulted voice verification, where user answers a relay of questions and with each question providing one bit of security in the new model here, each question generates multiple bits (binary choice is turned into multiple choice). Here the study has 4 choice questions. This model varies from traditional biometric system in this procedure does not collect data from user’s response and compare it to stored value in the database, instead it uses the response to choose between multiple possibilities which is presented by the server, in this

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

research an assumption of questions being independent from each other is made and also that an attacker chooses at random the possible answer is also made. If user is asked a relay of 5 questions, the probability of getting all right would be dependent on original vaulted voice verification that is 2−5 (one bit per question) but in new model 2 bits are generated per question thus adding 10 bits of security (for 5 question) thus the probability of accepting a random value and disclosing identity would increase to 2−10. Thus number of questions for a system can be wisely picked to maintain the balance between the chance of randomly disclosing an identity with the correctness of voice biometric and thus balancing number of questions with the biometric error rate (FAR). The biometric error rate is associated with biometric dictionary attack thus making vaulted voice verification with only 5 questions already better. This vaulted voice verification provides P-bits of security the encryption from the hashed user password with assuming that the device will lock up after a some selected number of attempts, S-bits of security from the server encryption of the template, K-bits of knowledge based security referring to an information that only user possess and is aware of not the attacker and B-bits of biometric identity security means the use of voice based models that takes less advantage of the differences in the voice and speech patterns of different speakers. Total bits of security is P+K+S+B (all bits are required).

  1. Joint operation of voice biometric and presentation attack detection [8]

Automatic speaker verification (ASV) are highly vulnerable to spoofing or presentation attacks which limits their wide practical deployment. Thus an effective anti-spoofing detection techniques which are formally known as presentation attack detection (PAD) systems had to be developed which must be integrated into existing ASV system for reducing complexity and to obtain one solution. The ease with which an ASV system ca be spoofed motivated researches into developing anti- spoofing detection i.e. PAD system that can accurately and efficiently distinguish between genuine speech and presentation attacks. Only developing presentation attack detection method is not enough for practical use, such systems must be integrated with ASV, the goal of such method is to be resistant to presentation attack and have verification accuracy as its ASV component. This method separates genuine data of a valid users from non-valid users or zero-imposters. For an attack detection component of joint ASV-PAD systems, 2 approaches are considered, the LBP based method which uses histogram of local binary pattern completed from an audio spectrogram and another approach uses MFCC feature with GMM based classifier. For ASV, verification based on ISV modeling and i- vectors is adopted. PAD, ASV and joint pad ASV-PAD systems are implemented as open source. There are 2 types of implementations:

  • A more traditional standalone downloadable package for which database needs to be provided in order to compute results.
  • Web based implementations based on the open source BEAT platform. According to the usage scenario, two most vulnerable places for spoofing attacks in an ASV system are physical access and logical access. In logical access, easier and available databases are focused.

In this study presentation attacks are considered which are more realistic and are a serious threat, in this attack it is assumed that whether a stolen set of user’s sample or an automatically generated samples is replayed to a microphone of an attacked ASV system to mimic the genuine user.

AV spoof database contains several types of replay attacks. One major constraint to develop PAD is lack of standard database with genuine-speech samples and a large variety of presentation attacks. So most comprehensive database containing spoofing attack is AV spoof. AV spoof databases has real speech samples from 44 participants 31 male, 13 females over a period of 2 months in 4 sessions with environment conditions such as background noises. Speech was recorded using laptop with microphone AT2020USB+, Samsung Galaxy S4 and I-Phone 3GS. Recordings were: Reading of 10 or 40 predefined sentences, pass phrases: 5 short prompts, free speech: where participants are allowed to speak on any topic for 3-10 minutes. In AV spoof database presentation attack consists of:

Replaying of genuine data using the device used for recording with high quality speaker, synthesized speech replayed with a laptop, and converted voice attacks replayed with a laptop.

This database is split into 3 non overlapping subsets which are Training subset, Development (Dev) subset and Test subset. Training subset is used for PAD system, dev subset is used for determining hyper parameters of a system including an equal error rate threshold. After score samples are obtained they are split into FAR&FRR (False acceptance rate and False rejection rate) which are equally split and equal error rate (ERR) which is a median value. Applying ERR threshold obtained from development set to the score of test set leads to another pair of FAR & FRR which are the measures for system’s performance. For a consistent PAD or ASV system, FAR & FRR values on the test set would be same obtained from dev set, to summarize performance of a system in one value, half total error rate (HTER) is computed as a mean of FAR & FRR. After PAD is sequentially combined with ASV, spoofed data is filtered first and non-spoofed data is passed for verification.

  1. Joint operation of voice biometric and presentation attack detection [8]

Remembering multiple password, PIN’s and memorable dates and other authentication details is hard, thus voice based authentication is beneficial and can replace remembering problems. Voice biometric can be used in many areas such as security, protection, education, call & web based services and these undergo several types of spoofing attacks and these frauds can be detected employing some techniques involved in this study. As voice is one of the widely used biometrics, user’s voice print itself can be used to verify the identity of claimed user. Detection of a fraud is a must and in this study two methodologies are used for classification of genuine and imposter speech and it is based on statistical modeling. First methodology is based on classification of genuine and replayed speech using a statistical modeling approach. The second methodology relies on the differences in terms of performance and robustness of different modes of operation for speaker

verification. The first proposed fraud using binary modeling detection consists of 2 stages (i) a genuine verification stage that estimates the likelihood of speech frame being a genuine speaker or replay attack. (ii) A fusion stage which does a low level fusion of results from 2 independent statistical engines. GMM (Gaussian mixture model) & HMM (Hidden Markov Model).

The HMM-UBM system the feature input was the same as GMM-UBM system. The 24 state HMM (including the initial state and the emitting observation 8 mixtures per state and diagonal covariance matrices) is trained without any speech transcriptions where a dummy word is assigned as a label for all training data. The HMM-UBM is used with a flat start & then its parameters are re-estimated with few iterations the Baum- Welch algorithm. Target speaker models are derived from the HMM-UBM with 3 iterations of MAP (maximum a posteriori) using their respective training data. So in the test phase, test utterances are forced aligned against the claimant and HMM- UBM log likelihood ratio. The input of general detection stage consists of speech frames which can come from either a genuine speaker or a replay speech file and the output of this stage is produced by likelihood estimator which gives likelihood ratio, which compares likelihood received speech frame being associated with a genuine speaker or a fraud (replay speech). In the second type (Fraud detection using fusion of scores from different modes of speaker verification operation) of proposed methodology UBM (Universal background model) is trained using all speech files from TIMIT corpora. The background model is used as template for generating individual speaker models, which is a compact representation of specific speaker identities by maximum a posteriori (MAP) adaption using speaker specific enrolment data. The protocols and trail sets are designed for the operational modes namely the fixed phase (mode A), text dependent (mode B) and text-independent phases (mode C).

Figure 3. Speaker verification model for fraud detection (second proposed model)


The feature vectors (representation of conversion of speech frames into compact speech signatures) are used by machine learning algorithm that fits the vectors to a closet match of a parametric statistical model. The fusion model must be trained with likelihood ratios from a variety of replayed speech frames and genuine speech frames against that from a speaker model trained on genuine speaker speech, the likelihood ratio indicates

the likelihood of a match of the spoof frame with the claimed speaker model. Later the trained fusion model maps a set of 3 inputs to a single fused decision score which is used to determine if the inputs represent a genuine or replayed input signals.


Voice biometric attacks, can be detected and to some extent solved with the techniques proposed in this paper. Integrating speaker verification and anti-spoofing techniques [6] help in detecting and correcting the vulnerabilities. Thus when single bit phrases can be out smarted multiple bits phrases can be introduced [7] which enhances the system into three factor authenticating system (enhancing the security) and eliminating the access of predefined database. Thus vaulted voice verification improves the ERR from 6% to 1%. Vaulted voice verification is demonstrated to use both text-independent and text dependent models accordingly. Integrating these models can be of great help and so is the integration of ASV and PAD [8]. The survey of experimental results further showed that this increased resistance to spoofing. Further different kind of frauds and vulnerabilities and solutions were discussed [9] which enhance the system’s security, this forms as a foundation for future attacks and defenses of voice biometric.



A. Sizov, E. Khoury, T. Kinnunen, Z. Wu and S. Marcel, ” Joint Speaker Verification and Antispoofing in the i-Vector Space,” IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 4, APRIL 2015, pp. 821-832, 2015.


S. Safavi, H. Gan, I. Mporas and R. Sotudeh, “Fraud Detection in Voice- based Identity Authentication Applications and Services,” in IEEE 16th International Conference on Data Mining Workshops, 2016.


P. Korshunov and S. Marcel, “Joint Operation of Voice Biometrics and Presentation Attack Detection,” in 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS), 09/2016, Niagara Falls, NY, USA, 2016.


R. Johnson, T. E. Boult and W. J. Scheirer, “Voice Authentication Using Short Phrases: Examining Accuracy, Security and Privacy Issues,” 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington, VA, USA, 2013.


M. Rouse, “voice authentication,” December 2016. [Online]. Available: https://whatis.techtarget.com/definition/voice-authentication.


M. Rouse, “biometric authentication,” December 2014. [Online]. Available: https://searchsecurity.techtarget.com/definition/biometric- authentication.


Y. Deguchi, “Vulnerability in Biometric Authentication,” mofiria Corporation, 25 October 2017. [Online]. Available: https://www.mofiria.com/biometrics-and-security-blog/biometrics- en/vulnerability-in-biometrics-en/?lang=en.


“Replay Attack,” [Online]. Available: https://www.techopedia.com/definition/21695/replay-attack.


“Evaluation of Presentation Attack Detection: An Example,” [Online]. Available: https://www.nist.gov/sites/default/files/documents/2016/12/01/15_tuesd

ay_johnson_evaluation_of_presentation_attack_detection_an_example_i bpc2014_sacs2.pdf.


Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: