ECMT 1020: Introduction to Econometrics 计量经济学 代写

ECMT 1020: Introduction to Econometrics

ECMT 1020: Introduction to Econometrics

Lecture 1

Instructor: Kadir Atalay

Contact: kadir.atalay@sydney.edu.au

School of Economics

The University of Sydney

Contact Information

Unit Coordinator & Instructor W1- W6 : Kadir Atalay

o Email: kadir.atalay@sydney.edu.au

o Office: Room 435, Merewether Building ( H04)

o Office Hours: Wednesday, 12.30 -14.30 or by appointment

Instructor W7-W13:Yi Sun

o Email: yi.sun@sydney.edu.au

o Office: Room 488, Merewether Building ( H04)

o Office Hours: Tentatively ; Tuesday 15.30 -17.30

Tutors: See Blackboard

Contact Information

Unit Coordinator & Instructor W1- W6 : Kadir Atalay

o Email: kadir.atalay@sydney.edu.au

o Office: Room 435, Merewether Building ( H04)

o Office Hours: Wednesday, 12.30 -14.30 or by appointment

Some Rules

o You should contact me by email.

o Use your USyd email - identify yourself with your name and SID

o Any questions regarding the tutorial program including administrative matters

regarding tutorial allocation should be directed to your tutor

Outline of Lecture

Course Outline

o Textbook

o Assessment

o Tutorials

o Unit Schedule

Analysis of Economic Data

o Types of Data

Univariate Data Summary

o Summary Statistics for Numerical Data

Course Website

We will have a course website on Blackboard:

o http://elearning.sydney.edu.au

Special Announcements: It is essential that you log in at least twice per week to

keep abreast of unit-wide announcements and use the resources to supplement

your learning.

UoS outline, online quizzes , practice questions, data files and lecture slides,

tutorial questions will be posted there.

Lecture slides will be posted, typically about 1 or 2 days before lecture.

Please treat lecture slides as an outline to read before the lecture and fill in the

gaps during or after class.

Textbook

The required text is

o “ ANALYSIS OF ECONOMICS DATA: AN INTRODUCTION TO

ECONOMETRICS” by A. Colin Cameron

This is a draft of book that will be published in late 2018. This version is

particularly tailored for ECMT 1020. We will cover first 17 Chapters (out of 20)

And it will be available as a course reader from University Copy Centre (by 28 th ).

o The University Copy Centre is located on the ground floor of the University

of Sydney Sports and Aquatic Centre.

There will be a copy on reserve in the library.

Additional texts for reference – all available in the library:

J.M. Wooldridge Introductory Econometrics: A Modern Approach. 5th Edition

(used in ECMT 2150); Gujurati, D.N. , Basic Econometrics , McGraw-Hill,

Assessment

• Your final grade for this unit will be based on six items:four online quizzes, a mid-

semester exam, and a final exam. All items are to be completed individually

ASSESSMENT TASKS AND DUE DATES

Assessment Name Weight Due Time Due Date

Online Quiz 1 5% noon 21-Aug-2017

Online Quiz 2 5% 20:00 8-Sept-2017

Mid-Semester Exam 30% 18.00 (Tentatively ) 12-Sept-2017

Online Quiz 3 5% noon 16-Oct-2017

Online Quiz 4 5% noon 3-Nov-2017

Final Exam 50% Final Exam Period Final Exam Period

Mid-Session Examination

o A 75 minutes exam will be held during Week 7 – (Tentatively Tuesday, 12

September 2017 , 18.00 pm ) The exact time and date will be announced

soon.

o Lecture Material for weeks 1-6 will be examined

Final Exam

o Final will be cumulative but will place greater emphasis on new topics (we

will go over what that means closer to the exam)

Lecture Topics

• First three weeks: univariate data. This is partly a recap of ECMT1010: How can we

summarize and visualize data? What can a sample tell us about the population, and

how can we express our uncertainty about such inference? What changes if we

transform our data? We will particularly focus on those aspects that are relevant for

economic analysis

• Second three weeks: bivariate data. How does one economic variable influence

another one? And again, how certain are we about our inference? We study both the

necessary theory and many economic examples

• Last six weeks: multivariate data. Here, we extend our results to cases where there is

not just one, but several explanatory variables. Finally, we also look at what to do if

the statistical model we estimate is not a good representation of economic reality: How

can we find out? And how much of our results can we salvage?

Tutorials

Tutorials start next week!

o There is one two-hour tutorial session each week, starting next week.

Participation is not mandatory, but is strongly encouraged. Tutorials are a

good opportunity to raise any questions you may have

o Use tutorials to raise questions about the material

o Exercises will be set each week. Do try to solve them!

The answers will be posted later, but before the mid-semester or final exam

In even-numbered weeks, tutorial sessions are held in regular classrooms.

These sessions are intended to become more familiar with the material

covered in class, as well as providing exam practice

In odd-numbered weeks, tutorial sessions are held in computer labs. These

sessions are intended to apply the material covered in class to real-world

economic problems, as well as learning the basics of the Stata software

package, which is widely used in later courses here at uni, as well as in

many jobs in the industry.

Computer Labs

o Week 3/5/7/9/11/13

o Computer Exercises

o Use of an econometrics or statistical package:

STATA

STATA

Throughout this unit you will be required to use a computer and specialised

econometric software. (Computer Labs/Tutorials /Online Quizzes)

The statistics and data analysis program STATA will be taught as part of this unit

– and will be regularly demonstrated during the lectures.

This software is available through the Virtual Desktop so you can use it in any of

the ICT Access Labs, Learning Hubs or Libraries. (see instructions in the UoS

outline). Also available in Labs 1-5 of Economics and Business Building (H69)

Some of the learning and access labs are listed below: PNR Learning Hub

;Carslaw Learning Hub; Wentworth Learning Hub; Law Access Lab; Madsen

Access Lab Cumberland Access Lab

If you wish to buy your own license to use STATA on your computer

o http://www.survey-design.com.au/buygradplan.html

(Small Stata will be sufficient for this course)

There is a brief introduction to Stata in Appendix A of the textbook. Generally,

we will just introduce new commands as they are needed. Stata's help facilities are

also pretty good.

Mathematics

•I appreciate many of you haven't had a lot of recent maths practice, and I'll try to

make things smooth.

• Calculus is not needed for this course, although it may help guide your intuitive

understanding of some of the material. Later ECMT courses, as well as higher-division

macro and micro units, will require it though.

• Some familiarity with basic algebra, such as working with summations, is assumed

• If you find that the algebra during the lectures or in the tutorials is moving too fast

for you,

1- please take advantage of the university's Maths Learning Centre. They have free

drop-in classes, including some specifically tailored for economics students. Don't

be ashamed or afraid, they're there to help!

2- LET ME KNOW!!!!! Happy to help you!

Chapter 1‐ Analysis of Economic Data

14

Use of Economic Data

In a nutshell, econometrics is the use of statistical methods to answer

economic questions

Describing the economic “landscape”

o What is the annual growth rate of GDP ? Has unemployment risen over

past year?

o Do people with higher levels of education earn more?

o Descriptive statistics motivate economic theory

Testing or attempting to distinguish between economic theories

o Is it true that stock returns are unpredictable?

Evaluating government and business policy

o Did those incredibly low interest rates in recent years really help stimulate

the economy?

Chapter 1‐ Analysis of Economic Data

15

RECAP ECMT 1010

Chapter 1‐ Analysis of Economic Data

16

RECAP ECMT 1010:

Chapter 1‐ Analysis of Economic Data

17

Types of Data

There are a variety of different types of data that you will encounter in economics. The

ways in which we categorize types of data include the following:

Value: numerical data, categorical data

Unit of observation: cross-section data, time series data, panel data

Number of variables: univariate data, bivariate data, multivariate data

Chapter 1‐ Analysis of Economic Data

18

Types of Data / Value / Numerical Data (Quantitative)

Numerical data are data that are naturally recorded and interpreted as numbers. They

can be continuous or discrete. Examples of numerical data include:

Annual income (continuous)

Hours worked (discrete)

Annual GDP (continuous)

Number of times a person has visited dentist (discrete)

Discrete numerical data take only integer values.

Types of Data / Value / Categorical Data

Categorical data are data that are recorded as belonging to one or more groups. They

can be recorded as numbers but these numbers have no inherent meaning. Examples of

categorical data include:

Gender ; Religion; Birth Place …

Chapter 1‐ Analysis of Economic Data

19

Chapter 1‐ Analysis of Economic Data

20

Types of Data / Units of Observation

Economics data are most often observational data, meaning they are based on

observations of actual behavior in an uncontrolled environment.

Types of Data/ Units of Observation / Cross-section data

Cross-section data are data on different entities collected at a common point in

time.

o Sample of individuals, households, firms, countries, other units taken at a

point in time (“snapshot”).

Notation: ? ? ,? ? 1,…,?

o i specifies a particular individual for an observation

o n is the total number of individuals observed ( typically called the sample

size)

o x is the value of whatever variable we are observing.

Examples: a single year of census data, unemployment rates by state for a

particular year

Chapter 1‐ Analysis of Economic Data

21

Examples of a cross-sectional data set:

Data set on hourly wages of individuals in 2014

observation hourly wage

1 17.15

2 35.54

3 51.05

.

.

.

498 16.87

499 19.00

500 41.35

? ? ,? ? 1,…,500 → ? ? ? 51.05 ; ? ??? ? 19.00

Note that the order of the observations (observation number) is not important.

Chapter 1‐ Analysis of Economic Data

22

Types of Data/Units of Observation / Time-series data

Time-series data are data on the same quantity at different points in time.

Notation: ? ? ,? ? 1,…,?

o t specifies time period of an observation

o T is the total number of time periods

o x is the value of whatever variable we are observing.

Examples: GDP of a country overtime, daily averages of the S&P,monthly

unemployment rate.

Example: data on minimum wages (Australia , 1950 to 1987)

Year hourly wage

1950 0.20

1951 0.21

1952 0.23

. .

. .

1987 3.35

Chapter 1‐ Analysis of Economic Data

23

Types of Data/ Units of Observation / Panel data

Panel data are data on different individuals with each individual observed at multiple

points in time.

Notation: ? ?,? ,? ? 1,…,?; ? ? 1,…,?

Panel data is a mixture of cross-section and time series data

Examples: earnings of USyd graduates over time; life expectancy by country over

time

Data set on hourly wages of individuals in 2013-14

observation person year hourly wage

1 1 2013 16.42

2 1 2014 17.15

3 2 2013 37.41

4 2 2014 35.54

. . . .

. . . .

499 250 2013 40.22

500 250 2014 41.35

Chapter 1‐ Analysis of Economic Data

24

Types of Data / Number of Variables / Univariate Data

Univariate data is a single data series containing observations of only one variable.

Notation: ? ? ??? ????? ??????? ???? ; ? ? ??? ???? ?????? ????

Examples: Earnings of uni.graduates in 2012; inflation rate from 1960 to 2014

Types of Data / Number of Variables / Bivariate Data

Bivariate data is composed of two potentially related data series.

Notation: ?? ? ,? ? ? ??? ????? ??????? ???? ;?? ? ,? ? ? ??? ???? ?????? ????

We are often interested in the relationship between x and y.

Examples: Education and earnings of individuals; inflation and unemployment

rates over time.

Chapter 1‐ Analysis of Economic Data

25

Types of Data / Number of Variables / Multivariate Data

Bivariate data is composed of three or more potentially related data series.

Notation: ?? ?,? ,? ?,? ,…,? ?,? ,? ? ? ??? ????? ??????? ???? ;

?? ?,? ,? ?,? ,…,? ?,? ,? ? ? ??? ???? ?????? ???? ;

We are often interested in how ? ? ,…? ? ??? ??????? ?? ?

Examples: Inputs and outputs and profits for a firm over time;

Education, experience, gender and income for a cross-section of individuals.

Chapter 1‐ Analysis of Economic Data

26

What do we do with economic data?

The basic steps of data analysis:

1- Data Summary

2- Statistical Inference

3- Interpretation

Chapter 1‐ Analysis of Economic Data

27

Steps of Data Analysis: Data Summary

To summarize data, we typically use a combination of visual representations of

the data and statistics

Visual representations include a variety of graphs and charts (scatterplots,

histograms, maps, etc.)

Statistics can measure characteristics of a single variable (mean, median, variance,

etc.) or relationships between multiple variables (covariance, correlation, linear

regression, etc.)

The choice of summary statistics and graphs depends on both the type of data

available and what the researcher is interested in

Chapter 1‐ Analysis of Economic Data

28

Steps of Data Analysis: Statistical Inference

The basic idea of statistical inference is to draw conclusions about a relationship

we cannot observe

We typically cannot reach definitive conclusions because we only get to observe a

sample rather than the population

Statistical inference requires using what we know about the sample and about

probability to reach a conclusion about the probable characteristics of variables

and relationships between them at the population level

RECAP - ECMT 1010

Chapter 1‐ Analysis of Economic Data

29

Chapter 1‐ Analysis of Economic Data

30

Reminder: Statistics (1)

• Statistics is using data to figure out as much as we can about a parameter that we cannot

observe

• Statistical model describes a population that we cannot observe. (Mainly because it

would be too much work -the education and salary of every person on Earth - or we have

a population that has infinite points “assume X follows a normal distribution…”)

• This model generally has one or a few parameters, describing the thing we're interested

in: the correlation between education and salary.

• We then assume that our dataset is a sample taken from the population we have

described. From this dataset, we calculate an estimator for the true but unknown

parameter: often something like a sample correlation, or a sample mean

• Standard practice in statistics is to use Greek letters for population quantities

? ? ,? ,?,?? ) and Latin letters for sample quantities ??̅,? ,?,? ). The textbook largely

follows this rule

Chapter 1‐ Analysis of Economic Data

31

• Finally, inference happens. Our estimator is probably not exactly equal to the

parameter, but can we say something about how far off it is likely to be? This is where

confidence intervals show up

• More formalities about sampling and inference later in the course, starting next week.

For today, we focus on the sample itself

Chapter 1‐ Analysis of Economic Data

32

Focus of this course: Regression Analysis

ECMT1010 focuses on data on a single variable considered in isolation (such as

coin toss)

In this class, we start analyzing univariate data – studying a single data series

(similar to ECMT1010)

Most economic data analysis is focused on measuring the relationship between

two or more variables.

o We want to understand the inter-relationships (and perhaps causality) ( such

as effect of minimum wage laws on unemployment)

o The main statistical method is called “regression analysis”.

Bivariate data (two related series) – Chapter 8 to12

Multivariate data (three or more related series ) – Chapter 13 to 17

Chapter 2‐ Univariate Data Summary

33

Chapter 2 - Univariate Data Summary

Univariate data are a single series of data that are observations on one variable.

A numerical data example is annual earnings for each person in a sample of

women.

A categorical data example is expenditures in each of a number of categories.

Our main focus :

(1) Summary Statistics for Numerical Data

(2) Charts for Numerical Data

Chapter 2‐ Univariate Data Summary

34

Summary Statistics for Univariate Data

Graphs are nice for giving people a quick glimpse of data

However, there is a lot of ambiguity about interpreting graphs and comparing one to

another.

Where is the mean? What is a wide distribution and what is a narrow one? Are tails

big or small? Etc.

Summary statistics give us a standardized way of summarizing univariate data

People know what the numbers mean and they can be compared across different

samples

Chapter 2‐ Univariate Data Summary

35

Types of Summary Statistics

We're often interested in describing the following characteristics of the

distribution of a data series:

o Central tendency – where is the center of the distribution of the data?

What is a typical Australian employee's salary, whatever “typical” means?

o Dispersion –how spread out is the data?

How much inequality is there in our income distribution?

o Skewness (asymmetry) – how symmetric (or asymmetric) is the distribution?

How many millionaires are there, compared to minimum-wage workers?

o Kurtosis (Peakedness) –how fat are the tails, how tall is the peak ?

How rare are minimum-wage workers and millionaires, compared to typical

earners?

Chapter 2‐ Univariate Data Summary

36

A little Math Review:

If X takes n values, ? ? ,? ? … ? ??? ,? ? their sum is

?? ?

?

???

? ? ? ? ? ? ? ? ? ? ⋯? ? ??? ? ? ?

If g(x) is a function of x, then

???? ? ?

?

???

? ??? ? ? ? ??? ? ? ? ??? ? ? ? ⋯? ??? ? ?

If “a” and “b” are constant, then

o ∑ ? ? ? ∗ ?

?

???

o ∑ ?? ?

?

???

? ?∑ ? ?

?

???

o ∑ ?? ? ?? ? ? ? ?? ? ?∑

? ?

?

???

?

???

o ∑ ?? ?

?

???

? ? ? ? ? ∑ ? ?

?

???

? ∑ ? ?

?

???

o ∑ ?? ?

?

???

∗ ? ? ? ? ∑ ? ?

?

???

∗ ∑ ? ?

?

???

Chapter 2‐ Univariate Data Summary

37

Types of Summary Statistics – Empirical Example

To go over these different types of summary statistics, we will use the following

example:

This is the distribution of annual earnings of a sample of 171 women who are 30 years

of age in 2010. The data are in “EARNINGS.dta” in BB.

0 5 10 15 20 25

Frequency

0 25000 50000 75000 100000 125000 150000 175000 200000

Earnings

Chapter 2‐ Univariate Data Summary

38

Measures of Central Tendency

A measure of central tendency / central location describes the center of the

distribution in the data

Tells us whether center of distribution is

Answer the question, “What is a typical value in this sample?”

Several measures

o Sample mean

o Sample median

o Sample midrange

o Sample mode

Chapter 2‐ Univariate Data Summary

39

The Sample Mean

Most common way to measure central tendency

It is also called as sample average

Definition:

?̅ ?

1

? ?? ?

?

???

Weights all observations equally!

STATA command mean variable_name

sum variable_name

tabstat variable_name, stat(mean)

Chapter 2‐ Univariate Data Summary

40

The Sample Median

Value that divides the sample into two halves (50% of observations are above

value and 50% are below)

Order data from lowest to highest value the median is that value that divides the

ordered data into two halves (is the one that ends up in the middle).

When n (number of observation) is an odd number, median is the middle value,

when n is an even number, use the average of the two middle observations.

Less sensitive to outliers than the sample average

(An outlying observation, or outlier is an observation that is unusually large or

small)

Other quantiles can be used

STATA command sum variable_name , detail

tabstat variable_name, stat(median)

Chapter 2‐ Univariate Data Summary

41

The Mean Vs. The Median

What is the typical Australian worker’s wages?

Among full time workers, the average wage is $72,000 per year in 2011

, the median wage is $57,400 per year in 2011

Note that mean is over 25% larger than the median.

Why is there such a big difference? Which of these numbers is more relevant.

Chapter 2‐ Univariate Data Summary

42

The Sample Midrange

The sample midrange is the average of the smallest and largest observations.

Not a very commonly used measure

Extremely sensitive to outliers

STATA command sum variable_name , detail (see 2 nd column)

The Sample Mode

The most frequently occurring value in sample

Useful with discrete data and cases where particular values are meaningful (4

years of high school,40 hours of work each week, ...).

STATA command tab

Chapter 2‐ Univariate Data Summary

43

Quartiles , Deciles and Percentiles

Median is the point that equally divides an ordered sample.

Lower Quartile is that point where ¼ (¾) of sample lies below (above)

Upper Quartile is that point where ¾ (¼) of sample lies below (above)

STATA command sum variable_name , detail (see 2 nd column)

Finer divisions:

p th percentile is the value for which p percent of the observed values are equal to

or less than the value.

Median – 50 th ; Upper Quartile- 75 th ; Lower Quartile- 25 th percentiles.

Deciles split the ordered sample into tenths.

Quantile is a percentile reported as a fraction of one rather than percentage.

(0.56 quantile =56 th percentile)

STATA command tabstat variable_name , stat(p1 p5 ..)

Chapter 2‐ Univariate Data Summary

44

These four measures of central tendency can give very different answers to the

question, what is a typical salary? Which one to use depends on which question you

are trying to answer.

Chapter 2‐ Univariate Data Summary

45

Measures of Dispersion

Characterize the spread or width of the distribution: How far away do observations

tend to be from the mean?

Different measures:

o Sample variance

o Sample standard deviation

o Sample coefficient of variation

o Sample range and inter-quartile range

Like measures of central tendency, the different measures have different benefits

and drawbacks

STATA command sum variable_name , detail (see 2 nd column)

tabstat variable_name , stat (… )

Chapter 2‐ Univariate Data Summary

46

Sample Variance

How far away do observations tend to be from the mean?

Simply calculating ?

? ∑

?? ? ? ?̅?

?

???

is not useful: positive and negative differences

cancel out and the result is always zero

So we worked with squared deviations instead. The sample variance is defined

? ? ?

1

? ? 1 ??? ?

? ?̅? ?

?

???

The division by n - 1 rather than n is a “degrees of freedom” correction, which is

necessary because we are using a sample mean ?̅ rather than the population

mean ?

When we start working with multivariate data, you'll often see n – k popping up

for much the same reason. This is worth remembering: in general,

“degrees of freedom = observations - estimated parameters”

Chapter 2‐ Univariate Data Summary

47

Sample Variance

Approximately equal to the average squared deviation from mean:

? ? ?

1

? ? 1 ??? ?

? ?̅? ?

?

???

As the sample variance increases, the spread of the data gets wider

STATA command sum variable_name , detail (see 3 rd column)

tabstat variable_name , stat(variance)

One problem with variances is that they're hard to interpret. If x is measured in

dollars, ? ? is in squared dollars - whatever that means

Chapter 2‐ Univariate Data Summary

48

Sample Standard Deviation

Standard deviation is just the square root of the variance:

? ? ? ? ? ? ?

1

? ? 1 ??? ?

? ?̅? ?

?

???

Roughly the average deviation of the data from its mean.

It has the same units as the data ( not the case in variance)

If one sample has a larger sample standard deviation than another, then we view

the sample as having greater variability.

STATA command sum variable_name (see 3 rd column)

tabstat variable_name, stat(sd)

Chapter 2‐ Univariate Data Summary

49

Interpretation of the Standard Deviation

A useful way to interpret the standard deviation is to use results for the normal

distribution (see ECMT 1010).

The probability of being within one, two standard deviations of mean is 0.68 and

0.95

For other distributions we know that at least ¾ of a random sample is within

the two standard deviation (Chebychev’s inequality)

Chapter 2‐ Univariate Data Summary

50

Recap:

Many things are approximately normally distributed. For normal distributions, we can

interpret the standard deviation as follows:

68% of the observations will be less than one standard deviation away from the

mean

95%, less than two standard deviations

Almost 100%, less than three standard deviations

Even if the distribution is not normal, we still have some bounds .

At least 75% within two sd, at least 88.89% within three sd

In general, at least a fraction 1 ? 1/? ? within c sd. This result is called

Chebychev's inequality (NO NEED TO MEMORIZE)

Chapter 2‐ Univariate Data Summary

51

Sample Coefficient of Variation

Sample standard deviation relative to sample mean

?? ?

?

?̅

Standardized measure: no units, can be compared across series.

STATA command sum variable_name, detail

(use the info in the second and third columns)

tabstat variable_name , stat(cv)

Chapter 2‐ Univariate Data Summary

52

Sample Range

Difference between the largest and smallest values in the sample

Simplest measure of dispersion but also the least interesting

Very sensitive to outliers

STATA command sum variable_name (last two columns).

tabstat variable_name , stat(range)

Sample Inter-Quartile Range

Variation on sample range that is less sensitive to outliers

Equal to difference between 75 th and 25 th percentile of the distribution

STATA command tabstat variable_name , stat(iqr)

Average Absolute Deviation

Another measure that is more resistant to outliers

?

? ?|? ?

? ?̅|

?

???

Chapter 2‐ Univariate Data Summary

53

Symmetry

A distribution is symmetric if its shape is the same when reflected around the

median. A common example is the normal distribution

Chapter 2‐ Univariate Data Summary

54

Measuring Symmetry (or Asymmetry)

Typically use skewness to measure symmetry

Right- skewed: Distribution has a long right tail and data are concentrated to the

left

Left-skewed: Distribution has a long left tail and data are concentrated to the right

Where are the mean and medians?

0 200 400 600 800

Frequency

0 2 4 6

x

Symmetric

0 500 1000 1500

Frequency

0 2 4 6

y

Right-skewed

0 200 400 600 800 1000

Frequency

0 2 4 6

z

Left-skewed

Chapter 2‐ Univariate Data Summary

55

One way to test for right- or left- skewed is to compare median to mean.

Symmetric: ?̅ ? ?????????

Right-skewed: : ?̅ ? ?????????

Left-skewed: : ?̅ ? ?????????

Formal Measure of Asymmetry is skewness test:

???? ?

1

? ∑

??

? ? ?̅? ?

?

???

? 1

? ∑

??

? ? ?̅? ?

?

???

?

?/?

Interpretation of static : symmetric = 0; right-skewed >0 ; left skewed <0.

STATA command tabstat variable_name, stat(skewness)

Chapter 2‐ Univariate Data Summary

56

Distribution of arrival delays for United Airline flights into San Francisco

International Airport, January 2014

Mean = 11.39; Median = 0 ; Skewness: 5.66

0 100 200 300 400 500

Frequency

-25 0 25 50 75 100 125 150

Arrival Delay (minutes)

Chapter 2‐ Univariate Data Summary

57

Distribution of 500 fastest 100m times as of December 2014

Mean = 9.90; Median = 9.92 ; Skewness:-1. 52

0 20 40 60 80 100

Frequency

9.58 9.63 9.68 9.73 9.78 9.83 9.88 9.93 9.98

x

Chapter 2‐ Univariate Data Summary

58

Kurtosis

Measures the relative importance of the observations in the tail of the distribution.

(How fat the tails of distribution are.)

Simplest measure is:

???? ?

1

? ∑

??

? ? ?̅? ?

?

???

? 1

? ∑

??

? ? ?̅? ?

?

???

?

?

Note: different computer programs can use slightly different formulae.

STATA command tabstat variable_name, stat(kurt)

Chapter 2‐ Univariate Data Summary

59

How to interpret:

Normal distribution with Kurtosis=3 is the benchmark.

Excess Kurtosis measures kurtosis relative to the normal distribution

?????????? ≅ ???? ? ?

If Excess Kurtosis is equal to 0, the distribution has the shape of normal distribution.

Positive Excess Kurtosis, the distribution has fat tails greater area in the tails than

for the normal distribution with the same mean and variance.

Negative Excess Kurtosis, the distribution has skinny tails.

Chapter 2‐ Univariate Data Summary

60

Chapter 2‐ Univariate Data Summary

61

ECMT 1020: Introduction to Econometrics 计量经济学 代写

ECMT 1020: Introduction to Econometrics 计量经济学 代写

How to present key summary statistics for the data?

Tables (see for example, Table 2.1 in your book)

Annual earnings of 30 year old female full time workers in 2010

Box and Whisker Plots (Box Plot)

All box-and-whisker plots give the lower quartile, median and upper quartile;

these form the “box.”

Simple box-and-whisker plots additionally give the minimum and maximum;

these form the “whiskers.”

More complicated box-and-whisker plots additionally plot outlying values.

Chapter 2‐ Univariate Data Summary

62

In complicated box and whiskers, whiskers are data-determined lower and upper

bounds. Outlying observations are the values that exceed these bounds.

This is a complicated form of

Box Plot.

In this case, upper bar equals to

upper quartile + 1.5 times inter-

quantile range.

The six dots are the outliers.

Lower bound is the minimum

sample value.

No outliers in the below lower

bound.

No values that are lower than.

(25k -1.50*(50k-25k) = -12.5k

Right-skewed data

Chapter 2‐ Univariate Data Summary

63

Chapter 2‐ Univariate Data Summary

64

Graphical Representations of Univariate Data

Chapter 2‐ Univariate Data Summary

65

Graphical Representations of Univariate Data

With univariate data, we have a few different options for graphing the data. The most

common are:

Histograms - graphs showing the frequency of occurrence of different values

Line charts - plots of the variable value against the observation number

Pie charts, bar charts, column charts - various ways to present observations

that are measured in different categories

Chapter 2‐ Univariate Data Summary

66

A Histogram example using absolute frequencies

o Absolute frequency - just the number of times a particular value is observed in the

data should be problematic if n is large (i.e. hard to read “y” axis)

STATA command histogram variable_name, frequency

0 10 20 30 40 50

Frequency

0 50000 100000 150000 200000

earnings

Chapter 2‐ Univariate Data Summary

67

A Histogram example using relative frequencies

o Relative frequency - the number of times a value is observed as a percentage of all

observations

STATA command histogram variable_name, percent

0 10 20 30

Percent

0 50000 100000 150000 200000

earnings

Chapter 2‐ Univariate Data Summary

68

Histograms

There are a few choices to make when constructing a histogram.

Whether to use absolute frequency or relative frequency for the vertical axis

o Absolute frequency - just the number of times a particular value is observed

in the data

ECMT 1020: Introduction to Econometrics 计量经济学 代写

ECMT 1020: Introduction to Econometrics 计量经济学 代写

o Relative frequency - the number of times a value is observed as a percentage

of all observations

o Either choice will lead to the same shape for the histogram

How large to make the bin sizes

o If the data take on many different values, you'll want to group data into bins

o In general, the more observations you have, the more bins you use.

o A common default choice is √?

Chapter 2‐ Univariate Data Summary

69

Histograms

Number of bins:

Few bins not enough information | too many bins hard to read

Rule of thumb is

√?

, in our example is √171 ? 13 ????

o The width of the bin (172,000 -1,050)/13 =13,150.

Stem and Leaf display (A variation of Histogram)

Chapter 2‐ Univariate Data Summary

70

Smoothed Histograms

Data that take many different values, such as earnings data, have an underlying

continuous probability density function rather than a discrete probability mass

function. (We are going to talk more about these next weeks).

This form of data can be better presented by a smooth graph, than discrete bins.

A smoothed histogram smooths the histogram in two ways.

o First, it uses rolling bins (or windows) that are overlapping rather than distinct.

o Second, in counting the fraction of the sample within each bin it gives more

weight to observations that are closest to the center of the window and less to

those near the ends of the window.

A well-known example is a kernel density estimate. (choice of window ,similar to

bin size)

Chapter 2‐ Univariate Data Summary

71

Kernel Density – Example of Earnings data

kdensity earnings kdensity earnings, bwidth(10000)

0 5.000e-06.00001 .000015 .00002 .000025

Density

0 50000 100000 150000 200000

earnings

kernel = epanechnikov, bandwidth = 5.0e+03

Default window width

0 5.000e-06 .00001 .000015 .00002

Density

0 50000 100000 150000 200000

earnings

kernel = epanechnikov, bandwidth = 1.0e+04

Wider window width

Chapter 2‐ Univariate Data Summary

72

Line Charts

When the observations in a univariate dataset have a natural order, it often makes sense

to use a line chart

A line chart plots successive values of the data against the successive index values

This offers an easy way to visualize whether values are getting larger or smaller

Line charts are most common with time series data

STATA command tsline variable_name

Data is real GSP per capita in US

Chapter 2‐ Univariate Data Summary

73

Categorical Data – Pie and Bar Charts

Histograms are good for representing numerical univariate data. For categorical

univariate data, we typically use pie charts or bar/column charts.

o Pie charts are perhaps the easiest way for people to visualize percentages

o Bar/column charts have the advantage of being able to show both relative and

absolute frequencies

o Bar/column charts will become more useful as we start adding more variables

For more on STATA graph commands:

(a) use drop-down “graphics” menu on the top-right corner

(b) type help graph in command window.

(c) Or just google…

Chapter 2‐ Univariate Data Summary

74

Some other examples of Visual Presentation of Data

Google Trends data (http://www.google.com.au/trends/) for the word “ cricket ” (blue

line) and the word “ football ”(red line) – only for Australia.

Chapter 2‐ Univariate Data Summary

75

Some other examples of Visual Presentation of Data

Wordle generated from Obama's 2009 State of the Union address (after start of

recession)

ECMT 1020: Introduction to Econometrics 计量经济学 代写