英文文献.docx

文档编号：30668329
上传时间：2023-08-19
格式：DOCX
页数：16
大小：559.22KB

英文文献.docx

《英文文献.docx》由会员分享，可在线阅读，更多相关《英文文献.docx（16页珍藏版）》请在冰豆网上搜索。

英文文献.docx

英文文献

Improvedspeechrecognitionmethod

forintelligentrobot

1Overviewofspeechrecognition

Speechrecognitionhasreceivedmoreandmoreattentionrecentlyduetotheimportanttheoreticalmeaningandpracticalvalue[5].Uptonow,mostspeechrecognitionisbasedonconventionallinearsystemtheory,suchasHiddenMarkovModel（HMM）andDynamicTimeWarping（DTW）.Withthedeepstudyofspeechrecognition,itisfoundthatspeechsignalisacomplexnonlinearprocess.Ifthestudyofspeechrecognitionwantstobreakthrough,nonlinear

-systemtheorymethodmustbeintroducedtoit.Recently,withthedevelopmentofnonlinea-systemtheoriessuchasartificialneuralnetworks（ANN）,chaosandfractal,itispossibletoapplythesetheoriestospeechrecognition.Therefore,thestudyofthispaperisbasedonANNandchaosandfractaltheoriesareintroducedtoprocessspeechrecognition.

Speechrecognitionisdividedintotwowaysthatarespeakerdependentandspeakerindependent.Speakerdependentreferstothepronunciationmodeltrainedbyasingleperson,theidentificationrateofthetrainingperson?

sordersishigh,whileothers’ordersisinlowidentificationrateorcan’tberecognized.Speakerindependentreferstothepronunciationmodeltrainedbypersonsofdifferentage,sexandregion,itcanidentifyagroupofpersons’orders.Generally,speakerindependentsystemismorewidelyused,sincetheuserisnotrequiredtoconductthetraining.Soextractionofspeakerindependentfeaturesfromthespeechsignalisthefundamentalproblemofspeakerrecognitionsystem.

Speechrecognitioncanbeviewedasapatternrecognitiontask,whichincludestrainingandrecognition.Generally,speechsignalcanbeviewedasatimesequenceandcharacterizedbythepowerfulhiddenMarkovmodel（HMM）.Throughthefeatureextraction,thespeechsignalistransferredintofeaturevectorsandactasobservations.Inthetrainingprocedure,theseobservationswillfeedtoestimatethemodelparametersofHMM.Theseparametersincludeprobabilitydensityfunctionfortheobservationsandtheircorrespondingstates,transitionprobabilitybetweenthestates,etc.Aftertheparameterestimation,thetrainedmodelscanbeusedforrecognitiontask.Theinputobservationswillberecognizedastheresultedwordsandtheaccuracycanbeevaluated.ThewholeprocessisillustratedinFig.1.

Fig.1　Blockdiagramofspeechrecognitionsystem

2Theoryandmethod

Extractionofspeakerindependentfeaturesfromthespeechsignalisthefundamentalproblemofspeakerrecognitionsystem.ThestandardmethodologyforsolvingthisproblemusesLinearPredictiveCepstralCoefficients（LPCC）andMel-FrequencyCepstralCo-efficient（MFCC）.Boththesemethodsarelinearproceduresbasedontheassumptionthatspeakerfeatureshavepropertiescausedbythevocaltractresonances.Thesefeaturesformthebasicspectralstructureofthespeechsignal.However,thenon-linearinformationinspeechsignalsisnoteasilyextractedbythepresentfeatureextractionmethodologies.Soweusefractaldimensiontomeasurenon2linearspeechturbulence.

ThispaperinvestigatesandimplementsspeakeridentificationsystemusingbothtraditionalLPCCandnon-linearmultiscaledfractaldimensionfeatureextraction.

2.1　LinearPredictiveCepstralCoefficients

Linearpredictioncoefficient（LPC）isaparametersetwhichisobtainedwhenwedolinearpredictionanalysisofspeech.Itisaboutsomecorrelationcharacteristicsbetweenadjacentspeechsamples.Linearpredictionanalysisisbasedonthefollowingbasicconcepts.Thatis,aspeechsamplecanbeestimatedapproximatelybythelinearcombinationofsomepastspeechsamples.Accordingtotheminimalsquaresumprincipleofdifferencebetweenrealspeechsampleincertainanalysisframeshort-timeandpredictivesample,theonlygroupofpredictioncoefficientscanbedetermined.

LPCcoefficientcanbeusedtoestimatespeechsignalcepstrum.Thisisaspecialprocessingmethodinanalysisofspeechsignalshort-timecepstrum.Systemfunctionofchannelmodelisobtainedbylinearpredictionanalysisasfollow.

Whereprepresentslinearpredictionorder,ak,（k=1,2,…,p）representspredictioncoefficient,Impulseresponseisrepresentedbyh（n）.Supposecepstrumofh（n）isrepresentedby

then

（1）canbeexpandedas

（2）.

Thecepstrumcoefficientcalculatedinthewayof（5）iscalledLPCC,nrepresentsLPCCorder.

WhenweextractLPCCparameterbefore,weshouldcarryonspeechsignalpre-emphasis,framingprocessing,windowingprocessingandendpointsdetectionetc.,sotheendpointdetectionofChinesecommandword“Forward”isshowninFig.2,next,thespeechwaveformofChinesecommandword“Forward”andLPCCparameterwaveformafterEndpointdetectionisshowninFig.3.

2.2SpeechFractalDimensionComputation

Fractaldimensionisaquantitativevaluefromthescalerelationonthemeaningoffractal,andalsoameasuringonself-similarityofitsstructure.Thefractalmeasuringisfractaldimension[6-7].Fromtheviewpointofmeasuring,fractaldimensionisextendedfromintegertofraction,breakingthelimitofthegeneraltopologysetdimensionbeingintegerFractaldimension,fractionmostly,isdimensionextensioninEuclideangeometry.

Therearemanydefinitionsonfractaldimension,eg.,similardimension,Hausdoffdimension,inforationdimension,correlationdimension,capabilityimension,box-countingdimensionetc,where,Hausdoffdimensionisoldestandalsomostimportant,foranysets,itisdefinedas[3].

Where,M￡（F）denoteshowmanyunit￡neededtocoversubsetF.Inthispaper,theBox-Countingdimension（DB）of,F,isobtainedbypartitioningtheplanewithsquaresgridsofside￡,andthenumberofsquaresthatintersecttheplane（N（￡））andisdefinedas[8].

ThespeechwaveformofChinesecommandword“Forward”andfractaldimensionwaveformafterEndpointdetectionisshowninFig.4.

2.3　Improvedfeatureextractionsmethod

ConsideringtherespectiveadvantagesonexpressingspeechsignalofLPCCandfractaldimension,wemixbothtobethefeaturesignal,thatis,fractaldimensiondenotestheself2similarity,periodicityandrandomnessofspeechtimewaveshape,meanwhileLPCCfeatureisgoodforspeechqualityandhighonidentificationrate.

DuetoANN′snonlinearity,self-adaptability,robustandself-learningsuchobviousadvantages,itsgoodclassificationandinput2outputreflectionabilityaresuitabletoresolvespeechrecognitionproblem.

DuetothenumberofANNinputnodesbeingfixed,thereforetimeregularizationiscarriedouttothefeatureparameterbeforeinputtedtotheneuralnetwork[9].Inourexperiments,LPCCandfractaldimensionofeachsampleareneedtogetthroughthenetworkoftimeregularizationseparately,LPCCis4-framedata（LPCC1,LPCC2,LPCC3,LPCC4,eachframeparameteris14-D）,fractaldimensionisregularizedtobe12-framedata（FD1,FD2,…,FD12,eachframeparameteris1-D）,sothatthefeaturevectorofeachsamplehas4*14+1*12=68-D,theorderis,thefirst56dimensionsareLPCC,therest12dimensionsarefractaldimensions.Thus,suchmixedfeatureparametercanshowspeechlinearandnonlinearcharacteristicsaswell.

ArchitecturesandFeaturesofASR

ASRisacuttingedgetechnologythatallowsacomputerorevenahand-heldPDA（Myers,2000）toidentifywordsthatarereadaloudorspokenintoanysound-recordingdevice.TheultimatepurposeofASRtechnologyistoallow100%accuracywithallwordsthatareintelligiblyspokenbyanypersonregardlessofvocabularysize,backgroundnoise,orspeakervariables（CSLU,2002）.However,mostASRengineersadmitthatthecurrentaccuracylevelforalargevocabularyunitofspeech（e.g.,thesentence）remainslessthan90%.Dragon'sNaturallySpeakingorIBM'sViaVoice,forexample,showabaselinerecognitionaccuracyofonly60%to80%,dependinguponaccent,backgroundnoise,typeofutterance,etc.（Ehsani&Knodt,1998）.MoreexpensivesystemsthatarereportedtooutperformthesetwoareSubarashii（Bernstein,etal.,1999）,EduSpeak（Franco,etal.,2001）,Phonepass（Hinks,2001）,ISLEProject（Menzel,etal.,2001）andRAD（CSLU,2003）.ASRaccuracyisexpectedtoimprove.

AmongseveraltypesofspeechrecognizersusedinASRproducts,bothimplementedandproposed,theHiddenMarkovModel（HMM）isoneofthemostdominantalgorithmsandhasproventobeaneffectivemethodofdealingwithlargeunitsofspeech（Ehsani&Knodt,1998）.DetaileddescriptionsofhowtheHHMmodelworksgobeyondthescopeofthispaperandcanbefoundinanytextconcernedwithlanguageprocessing;amongthebestareJurafsky&Martin（2000）andHosom,Cole,andFanty（2003）.Putsimply,HMMcomputestheprobablematchbetweentheinputitreceivesandphonemescontainedinadatabaseofhundredsofnativespeakerrecordings（Hinks,2003,p.5）.Thatis,aspeechrecognizerbasedonHMMcomputeshowclosethephonemesofaspokeninputaretoacorrespondingmodel,basedonprobabilitytheory.Highlikelihoodrepresentsgoodpronunciation;lowlikelihoodrepresentspoorpronunciation（Larocca,etal.,1991）.

WhileASRhasbeencommonlyusedforsuchpurposesasbusinessdictationandspecialneedsaccessibility,itsmarketpresenceforlanguagelearninghasincreaseddramaticallyinrecentyears（Aist,1999;Eskenazi,1999;Hinks,2003）.EarlyASR-basedsoftwareprogramsadoptedtemplate-basedrecognitionsystemswhichperformpatternmatchingusingdynamicprogrammingorothertimenormalizationtechniques（Dalby&Kewley-Port,1999）.TheseprogramsincludeTalktoMe（Auralog,1995）,theTellMeMoreSeries（Auralog,2000）,Triple-PlayPlus（Mackey&Choi,1998）,NewDynamicEnglish（DynEd,1997）,EnglishDiscoveries（Edusoft,1998）,and