# Statistics and Machine Learning Methods for EHR Data: From Data Extraction to Data Analytics

Real-World Data and Real-World Evidence: Big Data in PracticeUse of EMR/EHR Database for Research and Scientific Discoveries: Procedure and Life CycleInitiate a ProjectData Queries and Data ExtractionData CleaningData Pre-Processing or ProcessingData PreparationData Analysis, Modeling and PredictionResult ValidationResult InterpretationPublication and DisseminationChallenges and OpportunitiesReferencesEHR Project ManagementIntroductionProject and Sub-Project in EHR ResearchData, Code and Product ManagementTeam/People ManagementHow to Form a Team: What Expertise is Needed for EHR Projects?How to Efficiently Manage a Multidisciplinary Team?Task ManagementManagement Methods and Software ToolsAn Example of a Data Management FrameworkFolder ManagementFile ManagementUser ManagementData Management FrameworkDiscussion and SummaryAppendix--File Submission FormNoteReferencesEHR Databases and Data Management: Data Query and ExtractionIntroductionEHR/EMR Database Availability and AccessEHR/EMR Database Design and Structure: Database QueriesData ExtractionDefine Inclusion/Exclusion Criteria for Data ExtractionPhenotyping: Cohort IdentificationData Extraction ReportIllustration Example: Subarachnoid Hemorrhage (SAH) ProjectEHR Database Design and ConstructionSAH Cohort Identification and Data ExtractionData Extraction ReportPotential Data Extraction Pitfalls and Errors with SolutionsReferencesEHR Data CleaningIntroductionReview of Current Data Cleaning Methods and ToolsData WranglersData Cleaning Tools for Specific EHR DatasetsData Quality AssessmentCommon EHR Data Errors and Fixing MethodsList of Common Errors in an EHR DatabaseDemographics TableLab TableClinical Event TableDiagnosis and Medication TableProcedure TableDiscussionAcknowledgmentsNotesReferencesEHR Data Pre-Processing and PreparationIntroductionData Pre-ProcessingTidy Data PrinciplesFeature Extraction: Derived VariablesDimension ReductionMissing Data ImputationData PreparationDefine the Endpoint or OutcomeProcess Medical Record TimestampsDefine the Encounter Time IntervalEncounter CombinationDefine Comparison GroupsCohort RefiningLeakage DetectionData Preparation for Different Analysis PurposesData Processing/Preparation Errors and Pitfalls with SolutionsData Pre-Processing and Preparation ReportSummaryReferencesMissing Data Issues in EHRIntroduction and OverviewMissing Data MechanismsMethods for Incomplete EHR DataNaïve MethodImputation Using Statistical ModelsMachine Learning and Deep Learning ModelsChoice of Best Method for EHR DataCase StudyMissing Condition in EHR DataMissing Imputation in EHR DatasetsEvaluating the Performance of Imputation Methods and ThresholdsDiscussion and ConclusionReferencesCausal Inference and Analysis for EHR DataIntroductionWhy Causal InferenceOverview of Causal Inference Methods: Rubin Causal Model (RCM)Basic Framework in Causality: Potential Outcome FrameworkPropensity ScoringBrief IntroductionPropensity Scoring for Binary TreatmentsPropensity Scoring for Multiple TreatmentsPropensity Scoring for Ordinal TreatmentsPropensity Score Estimation for Complex Data SetsIllustration Example: Subarachnoid Hemorrhage (SAH) ProjectMediation AnalysisIntroduction to Mediation AnalysisThe Product MethodThe Difference MethodOther ConsiderationsInstrumental Variables Networks for Treatment Effect Estimation in the Presence of Unmeasured ConfoundersInstrumental Variables FrameworksTwo-Stage Least Square Methods with Linear ModelsLearning Treatment Effect by Generative Adversarial NetworksIntroductionCGANs as a General Framework for Estimation of Individualized Treatment EffectsWasserstein GANs for Estimation of Individualized Treatment EffectsMisCGANs for Estimation of Individualized Treatment EffectsOptimal Treatment SelectionDeconfounder in Estimation of Treatment EffectsIntroductionCausal Models with Latent ConfoundersAdversarial Learning ConfoundersLoss Function and Optimization for Estimating ITEs in the Presence of ConfoundersTargeted Maximum Likelihood EstimationSupplementary Note AWasserstein GAN A1 Different DistancesReferencesEHR Data Exploration, Analysis and Predictions: Statistical Models and MethodsIntroductionStatistical Challenges for EHR DataOverview of Existing MethodsData Exploration and VisualizationStatistical Models for EHR DataContingency TablesChi-Square TestHypergeometric TestGLMSurvival ModelMixed-Effect ModelsTime Series Analysis AR, MA and ARMA ModelGaussian ProcessVariable Selection MethodsStepwise Variable SelectionPurposeful Variable SelectionSISPenalty-Based MethodsDivide-and-Conquer MethodValidationResults and ExamplesDiscussions and ConclusionsReferencesNeural Network and Deep Learning Methods for EHR DataIntroductionDeep Learning Methods for EHR DataDeep Learning Software Tools and ImplementationApplication ExamplesDiscussionReferencesEHR Data Analytics and Predictions: Machine Learning MethodsMachine Learning OverviewMachine Learning MethodsMachine Learning Software ToolsApplication Example: SAH ProjectConclusion and RecommendationReferencesUse of EHR Data for Research: FutureFuture EHR ResearchPost-Research PracticeSummaryReferences