МАЪЛУМОТЛАР ТАСОДИФИЙ ЙЎҚОЛГАН ЧИЗИҚЛИ РЕГРЕССИЯ: БООТСТРАП ЁНДАШУВИ

Annotasiya

ОЛС регрессиялари нуқта ва интервалларни холис ва самарали баҳолаш учун бир қатор фаразларга эга. Тасодифий йўқолган маълумотлар (МНАР) чизиқли регрессияни баҳолашда жиддий муаммоларни келтириб чиқариши мумкин. Ушбу тадққотда биз МНАР маълумотлари билан ОЛС ишонч оралиғи баҳоларининг ишлашини баҳолаймиз. Биз, шунингдек, бундай маълумотлар ҳолатлари учун восита сифатида юклашни таклиф қиламиз ва анъанавий ишонч оралиқларини боотстрап билан солиштирамиз. Ҳақиқий параметрларни билишимиз кераклиги сабабли, биз симуляция тадқиқотини ўтказамиз. Тадқиқот натижалари шуни кўрсатадики, иккала ёндашув ҳам ўхшаш оралиқ ўлчамига эга ўхшаш натижаларни кўрсатади. Боотстрап жуда кўп ҳисоб-китобларни талаб қилишини ҳисобга олиб, анъанавий усулларни МНАР ҳолатида ҳам қўллаш тавсия этилади.

Manba turi: Jurnallar
Yildan beri qamrab olingan yillar 2024
inLibrary
Google Scholar
Chiqarish:

Кўчирилди

Кўчирилганлиги хақида маълумот йук.
Ulashish
Рахимов Z., & Рахимова N. (2024). МАЪЛУМОТЛАР ТАСОДИФИЙ ЙЎҚОЛГАН ЧИЗИҚЛИ РЕГРЕССИЯ: БООТСТРАП ЁНДАШУВИ. Iqtisodiy Taraqqiyot Va Tahlil, 2(4), 492–502. Retrieved from https://www.inlibrary.uz/index.php/eitt/article/view/48520
Crossref
Сrossref
Scopus
Scopus

Annotasiya

ОЛС регрессиялари нуқта ва интервалларни холис ва самарали баҳолаш учун бир қатор фаразларга эга. Тасодифий йўқолган маълумотлар (МНАР) чизиқли регрессияни баҳолашда жиддий муаммоларни келтириб чиқариши мумкин. Ушбу тадққотда биз МНАР маълумотлари билан ОЛС ишонч оралиғи баҳоларининг ишлашини баҳолаймиз. Биз, шунингдек, бундай маълумотлар ҳолатлари учун восита сифатида юклашни таклиф қиламиз ва анъанавий ишонч оралиқларини боотстрап билан солиштирамиз. Ҳақиқий параметрларни билишимиз кераклиги сабабли, биз симуляция тадқиқотини ўтказамиз. Тадқиқот натижалари шуни кўрсатадики, иккала ёндашув ҳам ўхшаш оралиқ ўлчамига эга ўхшаш натижаларни кўрсатади. Боотстрап жуда кўп ҳисоб-китобларни талаб қилишини ҳисобга олиб, анъанавий усулларни МНАР ҳолатида ҳам қўллаш тавсия этилади.


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

492


МАЪЛУМОТЛАР

ТАСОДИФИЙ

ЙЎҚОЛГАН

ЧИЗИҚЛИ

РЕГРЕССИЯ:

БООТСТРАП

ЁНДАШУВИ

PhD

Рахимов Заррух Аминович

Тошкент Халкаро Вестминстер Университети

ORCID: 0009-0001-0583-4819

zrakhimov@wiut.uz

Рахимова Нилуфар Аминовна

«Ипак йўли» туризм ва маданий меро

c

халқаро университети

ORCID: 0000-0002-8648-2543

nrahimova@wiut.uz

Аннотaция

.

ОЛС регрессиялари нуқта ва интервалларни холис ва самарали

баҳолаш учун бир қатор фаразларга эга. Тасодифий йўқолган маълумотлар (МНАР)
чизиқли регрессияни баҳолашда жиддий муаммоларни келтириб чиқариши мумкин. Ушбу

тадққотда биз МНАР маълумотлари билан ОЛС ишонч оралиғи баҳоларининг ишлашини

баҳолаймиз. Биз, шунингдек, бундай маълумотлар ҳолатлари учун восита сифатида

юклашни таклиф қиламиз ва анъанавий ишонч оралиқларини боотстрап билан
солиштирамиз. Ҳақиқий параметрларни билишимиз кераклиги сабабли, биз

симуляция

тадқиқотини ўтказамиз. Тадқиқот натижалари шуни кўрсатадики, иккала ёндашув ҳам

ўхшаш оралиқ ўлчамига эга ўхшаш натижаларни кўрсатади. Боотстрап жуда кўп ҳисоб

-

китобларни талаб қилишини ҳисобга олиб, анъанавий усулларни МНАР ҳолатида ҳам

қўллаш тавсия этилади.

Калит сўзлар:

чизиқли модел, намуна ўлчами, ишонч интервал, юклаш чизиғи,

аниқлик, интервал ўлчами, тасодифий эмас

ЛИНЕЙНАЯ РЕГРЕССИЯ С ОТСУТСТВИЕМ ДАННЫХ НЕ СЛУЧАЙНО:

МЕТОД БУТСТРАПА

PhD

Рахимов Заррух Аминович

Международный

вестминстерский

университет в Ташкенте

Рахимова Нилуфар Аминовна

Международный

университет

туризма "Шёлковый

Путь"

Аннотaция

.

Регрессии OLS имеют набор допущений, чтобы точечные и

интервальные оценки были несмещенными и эффективными. Отсутствие данных не

случайно (MNAR) может создать серьезные проблемы с оценками в линейной регрессии.

В этом исследовании мы оцениваем эффективность оценок доверительного интервала

OLS с данными MNAR. Мы также предлагаем загрузку как средство решения таких
случаев данных и сравниваем традиционные доверительные интервалы с загрузочными

интервалами. Поскольку нам необходимо знать истинные параметры, мы проводим

моделирование. Результаты исследования показывают, что оба подхода показывают

схожие результаты при одинаковом размере интервалов. Учитывая, что бутстрап
требует большого количества вычислений,

традиционные методы по

-

прежнему

рекомендуется использовать даже в случае MNAR.

Ключевые слова:

линейная модель, размер выборки, доверительный интервал,

бутстрап, точность, размер интервала, отсутствие не случайно.

UO

K: 330.43

IV SON - APREL, 2024

492-502


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

493

LINEAR REGRESSION WITH DATA MISSING NOT AT RANDOM:

BOOTSTRAP APPROACH

PhD

Rakhimov Zarrukh Aminovich

Westminster International University in Tashkent

Rahimova Nilufar Aminovna

Silk Road International University of Tourism and Cultural Heritage

Abstract.

OLS regressions have a set of assumption in order to have its point and interval

estimates to be unbiased and efficient. Data missing not at random (MNAR) can pose serious

estimations issues in the linear regression. In this study we evaluate the performance of OLS

confidence interval estimates with MNAR data. We also suggest bootstrapping as a remedy for
such data cases and compare the traditional confidence intervals against bootstrap ones. As we
need to know the true parameters, we carry out a simulations study. Research results indicate that

both approaches show similar results having similar intervals size. Given that bootstrap required

a lot of computations, traditional methods is still recommended to be used even in case of MNAR

Key words:

linear model, sample size, confidence Interval, bootstrap, accuracy, interval size,

missing not at random

Introduction.

Since the introduction, OLS regression has become one of the widely used modelling

techniques to show an impact of one or more variables to another dependent variable. This
linear modelling approach used primarily for two goals. Firstly, OLS regressions can explain the

relationship between two or more variables. Secondly, one can use OLS for simple forms of

forecasting. Though they never perfectly imitate the real world, linear models is very widely

used given its simplicity to build and ease of interpretability. Linear regressions provide almost
always an approximation of real life relationships. In order for our OLS regression give reliable

estimations, we must meet a set of OLS assumptions. These requirements are:

1.

Equal variance of the error term

2.

No strong multicollinearity between explanatory variables

3.

No severe outliers

4.

Sample size to be larger than 30 observation

5.

Linearity in relationship

6.

Normality of residuals

7.

Stationarity or no autocorrelation of residuals (in case of time series data)

8.

No important data missing in our dataset

In case any of these assumptions are violated, OLS confidence intervals might give

misleading outcomes and inferences. Interested researchers can refer to Gujarati (2004) for

more in-depth discussions of these assumptions and outcomes when they are violated. In this
study however, we will concentrate on the case when an important data points are missing not

at random. This case appears relatively often in cross sectional data when data collection in

certain segments of the society is quite difficult or impossible. The results of this study will be

of great benefits for cross sectional analysis which is applied not only in economic studies but
also in many other social sciences. As we need to know the true coefficient in order to evaluate

estimated intervals we will carrying out a simulation study and comparing both methods. In

later chapters, we are going to look at how OLS confidence intervals may behave when data is

missing not at random (hereafter referred as MNAR) and whether bootstrapping can serve as
a remedy for such cases.

The paper is structured in the following way. First, we will discussing any existing studies

on this topic and look at their findings. Afterwards, we will look at theoretical side of traditional

confidence interval estimation and bootstrapping of the data and building bootstrap intervals.


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

494

Next, we will have a look at simulation approached carried out in R. Lastly, we will look into the
results of the simulation and draw our conclusion.

Literature review.

Bootstrapping is a simple

е

t a powerful resampling tool for estimating the properties of a

certain statistic or parameter. The idea of bootstrapping lies in repeatedly resampling the

sample data. This approach has been pioneered first by Efron (1979) and since then, bootstrap

resampling has been widely used in many social sciences.

Bootstrap resampling can also be used in the context of linear models. In the literatures,

two types of bootstrapping is used in linear models, bootstrapping residuals and bootstrapping

pairs (Chernick and LaBudde, 2011).

Bootstrapping pairs:

bootstrapping pairs is a rather simple but powerful approach

proposed first by Freedman (1981). Under this approach, we resample independent and
dependent variables from the original sample which results in a bootstrap sample. We then use

usual OLS method to estimate

𝛽

from the bootstrap sample. This procedure is repeated B

times in order to get distribution of coefficients

𝛽

𝑗

estimates for

j=1,2,….,B

. This distribution in

turn can give bootstrap standard deviation.

When comparing two approaches, a paper by Efron and Tibshirani (1986) come to

conclusion that both approaches are equivalent when all assumption of the OLS are met, but

each approach can perform differently when number of observations is small. Comparing

compared bootstrapping residuals and bootstrapping pairs when the model is correctly

specified and when heteroscedasticity is present in the linear models, Flachaire (2003)
concludes that when a proper transformation to the residual term is applied (wild bootstrap),

residuals bootstrap performs better than bootstrapping pairs. Another paper by Chernick and

LaBudde (2011) finds however that bootstrapping vectors are less sensitive to violations of

model assumptions and can still perform well if those assumptions are not met. This can be

explained by the fact that the vector method does not depend on model structure while
bootstrapping residuals do.

Bootstrapping residuals:

As noted earlier, this is a resampling technique first introduced

by Efron in 1982. Let us consider the following model:

𝑌

𝑖

= 𝑔

𝑖

(𝛽) + ⁡ 𝑒

𝑖

, for

i=1,2,….,n

where

𝑔

𝑖

(𝛽)

is a function with a known form. To estimate

⁡𝛽

, we minimize distance

between our true dependent variable

𝑌

𝑖

and estimated function

𝑔

𝑖

(𝛽)

. These distances are

expressed in terms of residuals

𝑒

𝑖

̂

=

𝑌

𝑖

− 𝑔

𝑖

(𝛽̂)

. The idea behind Wild bootstrap is to take the

distribution of residuals each having probability of

1/n

for

i=1,2,….,n

and

sample

n

times from

this distribution to get bootstrap sample of residuals which can be denoted as

(𝑒

1

, 𝑒

2

, 𝑒

3

, … , 𝑒

𝑛

)

.

Afterwards, bootstrap dependent variable can be generated using

𝑌

𝑖

= 𝑔

𝑖

(𝛽̂) + ⁡ 𝑒

𝑖

. Now, as

we have our bootstrap dataset, we use simple OLS method to estimate

𝛽

.

We repeat the above

procedure

B

times to get a distribution of

𝛽

𝑗

estimates for

j=1,2,….,B.

One can get standard

deviation of

𝛽

to build bootstrap confidence intervals.

Other methods are also considered in further literature such as the percentile-t bootstrap

(Diciccio and Efron, 1992), stationary bootstrap (Politis and Roman, 1994) and each used under

different scenarios of non-constant variance of the residuals.

This study wants to shed further light into the method of bootstrapping pair in the context

of OLS models with data missing not at random.

Linear regression models.

Now, we will look into the method of building of linear models in more details. As

mentioned on earlier chapters, linear regressions try to reveal relationship between one y

(often referred as dependent variable ) and one or more x variables (often referred as explained

or dependent variable). The principle of linear model lies in mathematically calculating the beta

coefficients of those x variables. For example, somediv wants to evaluate whether having a

university degree influences ones income and if

е

s, by how much. Linear regression as intended


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

495

to answer exactly these questions using so called ordinary least squared (referred as OLS)

method where income is dependent “Y” variable, and еar of education is “X_1” explanatory

variable. Then coefficient of “years of educations” (

𝛽

1

)⁡

shows the size and direction (positive or

negative) of the influence.

𝑌 = 𝛽

0

⁡ + 𝛽

1⁡

∗ 𝑋

1

+ ⁡𝑒

Where

𝑌 −⁡

dependent variable,

𝛽

0

intercept (should not necessarily have meaning)

𝛽

1

coefficient of first explanatory variable,

𝑋

1

−⁡

explanatory or independent variable,

𝑒 −

error or residual term

Given formula is a clear example of linear relationship between X and Y variables.

Although, relationship between two variables is almost never linear in real life, linear
approximation has proven to work well in many domains. In practice, researchers take more X
variables that have been theoretically proved to affect selected depended variable Y. In order

to evaluate the correctness and accuracy of the model, a set up statistics such as R squared,

adjusted R squared, AIC or BIC are used in practice. This part is out of the score of our research,

although interested readers can refer to Greene (2004).

Estimation of coefficients in the above model is done with the method of least squares

commonly known as OLS (ordinary least squares). Least squares estimate of

𝛽

1⁡

is given by:

𝛽̂

1

= ⁡

𝑛

𝑖=1

(𝑋

𝑖

− 𝑋)(𝑌

𝑖

− 𝑌)⁡

𝑛

𝑖=1

(𝑋

𝑖

− 𝑋)

2

Where

𝑛 −

number of observations,

𝑋

𝑖

value of the independent variable for the i-th

observation,

𝑌

𝑖

value of the dependent variable for the i-th observation,

𝑋 −

mean of the

independent variable

𝑋

,

𝑌 −

mean of the independent variable

𝑌

Traditional confidence intervals.

We are very often interested in not only coefficient estimates of, but also interval of

possible values of the coefficient with certain level of confidence. In literature, the latter is
knows as confidence intervals. Researchers are interested in interval estiamtes because point

estimates of coefficients are always an approximation to true population value. In contrast,

interval estimations, commonly known as confidence intervals, have a set of advantages. Firstly,

it gives a range of values where true population value can be located. Secondly, confidence
intervals will indicate whether the true population parameter might be equal to 0. In other
words, whether the effect of that specific explanatory/independent variable to dependent

variable is insignificant. Currently, all statistical software provide both point and interval

estimates by default. Below, we will look at the theoretical side of building confidence intervals

of coefficients of linear models.

Confidence interval construction takes its origin from the core theory in statistics, Central

Limit Theorem (referred to CLT). CLT indicates that if one derives many sample averages from

many samples generated from the same population, then the distribution of sample averages is

approximately normal (also referred as Gaussian) (Lind et al, 1967). The midpoint of resulting
distribution of sample averages will be equal to the true population mean (see Figure 1). This

is a very strong finding that can also be applied in confidence interval construction.


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

496

Figure 1

In reality, we almost never can take many samples from the same population due to size

of the population (imagine taking 1000 samples of 10 000 size each) and very often left to work

with only one sample. Nevertheless, one can still make some estimation regarding the
population value (e.g. mean, coefficient) using the central limit theorem even when the

distribution of the population dataset is not known.

Confidence interval based on CLT: Consider we have only one sample from the population

data. Firstly, we can estimate the sample coefficient using the method of ordinary least squares
(discussed in previous chapter). Afterwards, we can estimate standard error of the estimated

coefficient using the following formula also arising from the method of least squares.

𝑠𝑒(𝛽̂

1

) = ⁡

𝑠

√∑

𝑛

𝑖=1

(𝑋

𝑖

− 𝑋)

2

Where

𝑠 −

standard deviation of the residuals (residual standard error),

𝑛 −

number of

observations,

𝑋

𝑖

value of the independent variable for the i-th observation,

𝑋 −

mean of the

independent variable

𝑋


As distribution of

𝛽̂

1

coefficient is approximately normal distribution based on central

limit theorem, we employ properties of standard normal distribution (z-distribution) and build

90%, 95% or 99% confidence intervals.

𝛽̂

1

± 𝑧

𝛼

2

∗ 𝑠𝑒(𝛽̂

1

)

Where

𝛽̂

1

- is sample coefficient estimate,

𝑧

𝛼

2

is a value from the standard normal

distribution the give an area of

𝛼

2

, 𝑠𝑒(𝛽̂

1

)

- sample variance of the coefficient

The above confidence interval can be understood in the following way. 97% interval

indicates that if we construct 100 confidence intervals from 100 random samples generated
from the true population, then 97 of those confidence intervals will contain true population

coefficient

⁡𝛽

1⁡

. Also, employing this confidence interval you can verify whether population

coefficient is insignificant. If estimated confidence interval contains zero, then one can suspect

that the true population parameter can be equal to zero (Gujarati, 2004)

Yet, the estimation of intervals and coefficients depends on the completeness of the data

which is one of the assumptions of the linear model. Intervals estimates may give inaccurate or

even biased calculations if certain portion of very important data is missing. In this study we

look at this case also known as Data Missing Not at Random.


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

497

In the next section, we suggest another way, bootstrapping, of handling in residuals for

construction of our confidence intervals for coefficients.

BOOTSTRAP CONFIDENCE INTERVAL ESTIMATION

Bootstrap confidence intervals offer alternative ways of building intervals which is rather

simple approach. Bootstrap implies selecting one sample and generating many other different

samples from this single original sample and estimating your parameter of interest in each

newly created sample. Under the bootstrap approach, the original sample is considered as a
population and we generate many other samples (known as bootstrap samples) out of it. When

a large number of bootstrap samples are created, we estimate sample parameters (e.g.

coefficient) from every bootstrap sample. Consequently, we will have a distribution of

bootstrap sample estimates.

This distribution of bootstrap sample estimates can be used to construct our confidence

intervals. For example, if we want to construct a 95 percent interval, we take 2.5th and 97.5th

percentiles from bootstrap distribution. Figure 2 explains visually the method of bootstrapping.

Figure 2

SIMULATION

In this section, we discuss simulation of linear regression and introduce case of data

missing not at random. We do not use real life data, but we rather simulate for two reasons. In
the first place, true population coefficient

𝛽

1⁡

should be known to us and in real life we almost

never know the true parameters. In the second place, we need to be aware of the form of data

missing not at random, i.e. what share of data is missing and from which variable. We rely on

existing papers to imitate a similar form of data missing not at random. Our simulation starts
with the simplest form of linear model with one explanatory variable as given below

Y =

𝛃

0

+

𝛃

1

*

𝑋1

+

Ɛ

where

𝑋1⁡~⁡𝑁(5,⁡

4)

Ɛ⁡~⁡𝑁(0,50

)


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

498

where intercept (

𝛽

0⁡

) and

𝛽

1⁡

are defined by us. Independent variables (

X

1

) come from

normal distribution with mean of 5 and standard deviation of 4. Error term has mean of 0 and

variance of 50.

In order to simulate data missing not a random, we follow the approach of Schafer et al

(2002) where certain part of upper percentile of X variable is removed. In our case we take
above 80

th

percentile data from X and remove 90 per cent of that data. Those values will be

labelled as NA or Null (in R studio, both are treated equally). Afterwards, we construction

confidence intervals using both approaches, traditional and bootstrap ones. In order to evaluate

the performance at difference sample size, first we start with sample size of 30 and then we
increase it by 10 observations up to 200 observations. All of the simulations are carried out in

R software.

We take the following steps for simulation of linear model with heteroscedasticity

with different sample sizes

Step 1: set intercept

𝛃

0

= 4 and coefficient

𝛃

1

=5

Step 2: Set sample size to n=30

Step 3: generate

X1 ~ N(5, 4)

starting with sample size n

Step 4: generate

Ɛ⁡~⁡𝑁(0,50

)

starting with sample size n

Step 5: generate

Y

with

Y =

𝛃

0

+

𝛃

1

*

𝑋1

+

Ɛ

Step 6: take X observations that are above 80

th

percentile and remove 90 per cent of that

data.

Step 7: Estimate confidence intervals using traditional and bootstrap methods in

repeated simulations (1000 times). Here we construction 95 percent confidence intervals

Step 8: evaluate how many times (out of 1000), true parameters were within estimated

OLS and bootstrap confidence intervals

Step 9: repeat step 2 to step 8 by adding 10 observations to sample size (n=n+10). Finish

when sample size reaches 200 observations

Traditional and bootstrap confidence intervals estimations are discussed in above

sections. For traditional intervals, we use the following formula which is estimated in any
statistical package when we construct our linear model.

𝛽̂

1

± 𝑡

𝛼

2

∗ 𝑠𝑒(𝛽̂

1

)

Bootstrap confidence intervals are built taking values in certain percentiles of parameter

distributions that were generated as a result of bootstrapping.

Results

This part will introduce us with the outcomes of different simulations carried out in R

studio software. One simulation is with correctly specified model with no missing data and

second is with MNAR data. We also take a look at how estimated intervals change as we change

our sample size.

Correctly specified model
In the first place, it is necessary to evaluate how traditional confidence interval and

bootstrap confidence intervals perform when all data is present and we

don’t have any violation

of regression assumptions. According to theory and many revised studies, it is expected that

both methods will perform relatively similar to each other. In other words, for 95 percent
confidence intervals, we expect true parameters to fall within estimated intervals at least 95

per cent of cases.

Figure 3 below illustrates how often true coefficients fall within estimated confidence

intervals built using traditional and bootstrap methods. We can observe that both approaches

are doing pretty good, that is constructed intervals are containing true coefficient at least 95
per cent of the cases with different sample size. In other words, the chart clearly shows that


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

499

both traditional and bootstrap confidence intervals contain true parameter in 90-100 percent
of the cases which is expected outcomes.

Figure 3

Bootstrap confidence intervals contain true coefficients more often compared to

traditional OLS intervals. This is explained in the second graph which shows that bootstrap
intervals are larger in width compared to OLS intervals across all sample sizes (see Figure 4)

Figure 4

Data missing not at random

Here we will be looking at performance of traditional and bootstrap interval estimations

when large portion of upper percentile of explanatory variables is missing. To remind the

reason, we tool upper 80

th

percentile of X variables and removed 90 per cent of that data.

Afterwards, we estimated confidence intervals using traditional and bootstrap approaches.


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

500

Lastly we evaluated how often the true coefficient from our simulation was falling within the
given interval. Ideally, the true coefficient must fall in 95 per cent of simulated cases.

The results in Figure 5 indicate that accuracy of traditional and bootstrap intervals

estimates are oscillating around 95 per cent which is out benchmark. This indicates that both

approaches are doing pretty well in term of interval estimates even when quite important
portion of data is missing. This is a very strong and good finding in favor of traditional

approaches.

This tells us that even when large share of important data is missing, traditional central

limit theorem based interval estimation is doing a pretty good work.

Figure 5

If we compare sizes of confidence intervals from Figure 6 estimated using traditional

and bootstrap methods, one can see that both approaches have a very similar size.

Given that bootstrap requires a lot of computing power and both approaches are

showing similar results, we can conclude that traditional approach is still reliable even when

good share of important data is missing not at random.

Figure 6


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

501

Conclusion

This study looked into cases when important data is missing not at random and looked at

two ways of interval estimations of coefficients of linear regression. In the first place, we revised

related literature on topic of data MNAR. Based on our investigation, there is limited literature

on application of bootstrap approach in case of data missing not at random. Afterwards, we
investigated the theoretical side of linear models and traditional way of building confidence

intervals that is based on central limit theorem. Along with that, we also looked into bootstrap

approach of constructing confidence intervals. We have employed bootstrapping pair

approaches that does not have any distributional assumptions. In order to evaluate the
performance, we need to know the true parameters. For this reason, we carried out a simulation

of a simple linear model with one explanatory variable. In order to evaluate performance of

both approaches we simulated our regression with MNAR data with different sample size,
spanning from 30 to 200 observations. Simulation results indicate that even when important
data is missing not at random, both, traditional and bootstrap methods are building rather good

intervals. In other words, both interval estimates have been including the true coefficient in

around 95 per cent of the cases. In additional, interval sizes of both, traditional and bootstrap

confidence intervals are quite similar. This is rather strong finding in favor of both approaches.
Yet, as bootstrap requires intense computational power while traditional methods is estimated

in a fast way, we conclude that researchers are recommended to still use traditional method

even when good share of important data is not missing at random.

Reference:

Carpenter, J. R., & Kenward, M. G. (2012). Missing data in clinical trials: a practical guide.

Practical Guides to Biostatistics and Epidemiology. Cambridge University Press.

Chernick, M. R., and LaBudde, R. A. (2014). An introduction to bootstrap methods with

applications to R. John Wiley & Sons.

Chernozhukov, V., and Hong, H. (2003). An MCMC approach to classical estimation. Journal

of Econometrics, 115(2), 293-346.

Davison , A. C. , and Hinkley , D. V. (1997). Bootstrap Methods and Their Applications.

Cambridge University Press, Cambridge .

DiCiccio , T., and Efron , B. (1992). More accurate confidence intervals in exponential

families. Biometrika 79, 231

245 .

Efron , B., and Tibshirani , R. (1986). Bootstrap methods for standard errors, confidence

intervals and other measures of statistical accuracy. Statistical Science. Vol. 1 , 54

77

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics,

7(1), 1-26.

Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM,

Philadelphia

Fan, Y., and Li, Q. (2004). A consistent model specification test based on the kernel density

estimation. Econometrica, 72(6), 1845-1858.

Flachaire, E. (2007). Bootstrapping heteroscedastic regression models: wild bootstrap vs

pairs bootstrap. Computational Statistics and Data Analysis, 49 (2), 361-376

Freedman , D. A. (1981). Bootstrapping regression models. Annals of Statistics, 9, 1218

1228

Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural

equation models. Structural Equation Modeling, 10(1), 80-100.

Greene, W. H. (2021) Econometric Analysis, 8th edn, Pearson
Gujarati, D. N., Porter, D. C., and Gunasekar, S. (2012). Basic econometrics. McGraw-Hill

Higher Education


background image

Iqtisodiy taraqqiyot va tahlil, 2024-yil, aprel

www.e-itt.uz

502

He, Y., & Zaslavsky, A. M. (2012). Diagnostics for multiple imputation in surveys with

missing data. Biometrika, 99(4), 731-745.

Horowitz, J. L., and Markatou, M. (1996). Semiparametric estimation of regression models

for panel data. Review of Economic Studies, 63(1), 145-168.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2023). An Introduction to Statistical

Learning. Publisher.

Lind, D. A., Marchal, W. G., and Wathen, S. A. (1967). Statistical Techniques in Business and

Economics (2

nd

ed). Publisher

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. Wiley.
Liu , R. Y. (1988). Bootstrap procedures under some non i.i.d. models . Annals of Statistics

16, 1696

1708

Politis, D. and Romano, J, (1994). The Stationary bootstap. The journal of American

Statistical Association. 89 (428), 1303-1312

Schafer, J. L., & Graham, J. W. (2002). Multiple imputation for missing data: A cautionary

tale. Sociological Methods & Research, 31(4), 445-454.

Bibliografik manbalar

Carpenter, J. R., & Kenward, M. G. (2012). Missing data in clinical trials: a practical guide. Practical Guides to Biostatistics and Epidemiology. Cambridge University Press.

Chernick, M. R., and LaBudde, R. A. (2014). An introduction to bootstrap methods with applications to R. John Wiley & Sons.

Chernozhukov, V., and Hong, H. (2003). An MCMC approach to classical estimation. Journal of Econometrics, 115(2), 293-346.

Davison , A. C. , and Hinkley , D. V. (1997). Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge .

DiCiccio , T., and Efron , B. (1992). More accurate confidence intervals in exponential families. Biometrika 79, 231 – 245 .

Efron , B., and Tibshirani , R. (1986). Bootstrap methods for standard errors, confidence intervals and other measures of statistical accuracy. Statistical Science. Vol. 1 , 54 – 77

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1-26.

Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia

Fan, Y., and Li, Q. (2004). A consistent model specification test based on the kernel density estimation. Econometrica, 72(6), 1845-1858.

Flachaire, E. (2007). Bootstrapping heteroscedastic regression models: wild bootstrap vs pairs bootstrap. Computational Statistics and Data Analysis, 49 (2), 361-376

Freedman , D. A. (1981). Bootstrapping regression models. Annals of Statistics, 9, 1218 – 1228

Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural equation models. Structural Equation Modeling, 10(1), 80-100.

Greene, W. H. (2021) Econometric Analysis, 8th edn, Pearson

Gujarati, D. N., Porter, D. C., and Gunasekar, S. (2012). Basic econometrics. McGraw-Hill Higher Education

He, Y., & Zaslavsky, A. M. (2012). Diagnostics for multiple imputation in surveys with missing data. Biometrika, 99(4), 731-745.

Horowitz, J. L., and Markatou, M. (1996). Semiparametric estimation of regression models for panel data. Review of Economic Studies, 63(1), 145-168.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2023). An Introduction to Statistical Learning. Publisher.

Lind, D. A., Marchal, W. G., and Wathen, S. A. (1967). Statistical Techniques in Business and Economics (2nd ed). Publisher

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. Wiley.

Liu , R. Y. (1988). Bootstrap procedures under some non i.i.d. models . Annals of Statistics 16, 1696 – 1708

Politis, D. and Romano, J, (1994). The Stationary bootstap. The journal of American Statistical Association. 89 (428), 1303-1312

Schafer, J. L., & Graham, J. W. (2002). Multiple imputation for missing data: A cautionary tale. Sociological Methods & Research, 31(4), 445-454.