1 quine.txt

  1. 자료설명: 호주의 뉴사우스웨일즈주 농촌지역의 학생들에 대하여 주민구분, 성별, 연령 및 학습자의 상태에 따라 결석률을 조사한 자료임

  2. 변수설명

    변수명(컬럼명) 변수설명 자료속성
    Eth 주민 구분 A:이주민(Aboriginal), N: 원주민
    Sex 성별 F: 여자, M: 남자
    Age 연령그룹 F0, F1, F2, F3
    Lrn 학습자 상태 AL: 보통, SL: 느린 학습자
    Days 결석일수
  3. 출처: S. Quine, quoted in Aitkin, M. (1978) The analysis of unbalanced cross classifications (with discussion). Journal of the Royal Statistical Society series A 141, 195–223.

2 mtcar

  1. 자료설명: 1974년 Motor Trend USmagazine에서 추출한 데이터로, 32대의 차종에 따라 연비, 디자인 및 성능평가지표를 측정한 자료임

  2. 변수설명

    변수명(컬럼명) 변수설명 자료속성
    mpg 연비 (Mile/gallon)
    cyl 실린더 수
    disp 배기량
    hp 총 마력
    drat 뒷차축 비율
    wt 중량 (단위: 천파운드)
    qsec 1/4 마일 시간
    vs V/S
    am 트랜스미션 자동(0), 수동(1)
    gear 전진 기어 수
    carb 기화기(carburetors)수
  3. 출처: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

3 고용률.xlsx

  1. 자료설명:고용률은 15세 이상 인구 중 취업자의 비율임

  2. 변수설명

    변수명(컬럼명) 변수설명 자료속성
    연도 연도 1980-2015년
    전체 전체 고용률
    남자 남자 고용률
    여자 여자 고용률
  3. 출처: 통계청, 「경제활동인구조사」, 각 년도.

4 대졸자 실업률.csv

  1. 자료설명:4년제 대학졸업자의 연도별 실업률

  2. 변수설명

    변수명(컬럼명) 변수설명 자료속성
    year 연도 2000-2016년
    x24 20~24세 실업률
    x29 25~29세 실업률
  3. 출처: 통계청, 「경제활동인구조사」, 각 년도.

5 Bone Mineral Density Data

  1. Description: Measurements in the bone mineral density of 261 north american adolescents, as function of age. Each value is the difference in spnbmd taken on two consecutive visits, divided by the average. The age is the average age over the two visits. 북미에 거주하는 261명의 청소년들을 대상으로 척추 골밀도(척추에 포함된 미네랄의 밀도)를 2회에 걸처 측정한 자료

  2. Objectives
    1. ….
    2. ….
  3. Source: ElemStatLearn 패키지의 bone

  4. Variables - A data frame with 485 observations on the following 4 variables.
    1. idnum: 청소년 식별번호
    2. age: 평균나이(2회 조사)
    3. gender: 성별(female, male)
    4. spnbmd: 상대척추골밀도((1차측정값-2차측정값)/전체평균)
  5. Data preview

    library(ElemStatLearn)
    data("bone")
    knitr::kable(head(bone))
    idnum age gender spnbmd
    1 11.70 male 0.0180807
    1 12.70 male 0.0601093
    1 13.75 male 0.0058575
    2 13.25 male 0.0102639
    2 14.30 male 0.2105263
    2 15.30 male 0.0408432
  6. References
    • Bachrach LK, Hastie T, Wang M-C, Narasimhan B, Marcus R. Bone Mineral Acquisition in Healthy Asian, Hispanic, Black and Caucasian Youth. A Longitudinal Study. J Clin Endocrinol Metab (1999) 84, 4702-12.

6 Market Basket Analysis

  1. Description: 샌프랜시스코의 쇼핑몰 고객 8993명에 대한 설문조사 자료 중 일부
  2. Objectives: The goal is to predict the Anual Income of Household from the other 13 demographics attributes.

  3. 변수
    1. Income: ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE)
        1. <10,000; 2. 10,000~14,999; 3. 15,000~19,999; 4. 20,000~24,999; 5.25,000~29,999;
        1. 30,000~39,999; 7. 40,000~49,999; 8. 50,000~74,999; 9. >=75,000
    2. Sex: 1. Male 2. Female

    3. Marital: 1. Married 2. Living together, not married 3. Divorced or separated 4. Widowed 5. Single, never married

    4. Age: 1. 14 thru 17 2. 18 thru 24 3. 25 thru 34 4. 35 thru 44 5. 45 thru 54 6. 55 thru 64 7. 65 and Over

    5. Edu: 1. Grade 8 or less 2. Grades 9 to 11 3. Graduated high school 4. 1 to 3 years of college 5. College graduate 6. Grad Study

    6. Occupation
        1. Professional/Managerial 2. Sales Worker 3. Factory Worker/Laborer/Driver 4. Clerical/Service Worker
        1. Homemaker 6. Student, HS or College 7. Military 8. Retired 9. Unemployed
    7. Lived: HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA?
        1. < 1 year; 2. 1~3; 3. 4~6; 4. 7~10; 5. >10 years
    8. Dual_Income: DUAL INCOMES (IF MARRIED) 1. Not Married 2. Yes 3. No

    9. Household: PERSONS IN YOUR HOUSEHOLD
        1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more
    10. Householdu18: PERSONS IN HOUSEHOLD UNDER 18
        1. None 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more
    11. Status: HOUSEHOLDER STATUS
        1. Own 2. Rent 3. Live with Parents/Family
    12. Home_Type: 1. House 2. Condominium 3. Apartment 4. Mobile Home 5. Other

    13. Ethnic
        1. American Indian 2. Asian 3. Black 4. East Indian 5. Hispanic 6. Pacific Islander 7. White 8. Other
    14. Language: WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME?
        1. English 2. Spanish 3. Other
  4. Data preview

    knitr::kable(head(marketing ))
    Income Sex Marital Age Edu Occupation Lived Dual_Income Household Householdu18 Status Home_Type Ethnic Language
    9 2 1 5 4 5 5 3 3 0 1 1 7 NA
    9 1 1 5 5 5 5 3 5 2 1 1 7 1
    9 2 1 3 5 1 5 2 3 1 2 3 7 1
    1 2 5 1 2 6 5 1 4 2 3 1 7 1
    1 2 5 1 2 6 3 1 4 2 3 1 7 1
    8 1 1 6 4 8 5 3 2 0 1 1 7 1
  5. References
    • Impact Resources, Inc., Columbus, OH (1987). A total of N=9409 questionnaires containg 502 questions were filled out by shopping mall customers in the San Francisco Bay area.

7 Handwritten Digit Recognition Data

  1. Description:
    • 미국 체신청(U.S. Postal Service)에서 편지봉투에 쓰여진 우편번호(zip codes)에서 각 숫자(digit)을 이미지로 스캔하여 저장한 자료로 16*16 그레이스케일로 구성됨 (Le Cun et al., 1990)
    • 이 데이터는 기계학습 및 인공신경망 모형을 위한 벤치마크 데이터로 사용되고 있음
  2. 데이터 구조:
    • 첫 컬럼은 실제 사람이 인식한 숫자이고 나머지 256컬럼은 16*16 이미지를 벡터로 표현 (각 값은 -1~1)
  3. Data preview

    zip.train[1:5, 1:10]
    ##      [,1] [,2] [,3] [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
    ## [1,]    6   -1   -1   -1 -1.000 -1.000 -1.000 -1.000 -0.631  0.862
    ## [2,]    5   -1   -1   -1 -0.813 -0.671 -0.809 -0.887 -0.671 -0.853
    ## [3,]    4   -1   -1   -1 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
    ## [4,]    7   -1   -1   -1 -1.000 -1.000 -0.273  0.684  0.960  0.450
    ## [5,]    3   -1   -1   -1 -1.000 -1.000 -0.928 -0.204  0.751  0.466
    im <- lapply(1:4, function(r){
      o <- lapply((r-1)*10 + c(1:10), function(i) {
        im <- matrix(zip.train[i,-1], 16, 16)
        im[, 16:1]
      })
      do.call(rbind, o)
    })
    im <- do.call(cbind, im[length(im):1])
    image(im,  col=gray(256:0/256), zlim=c(0,1), xlab="", ylab="", xaxt="n", yaxt="n", bty="n" ) 

8 Movies Data

  1. Description: There data were obtained from IMDB in R package ggplot2movies

    # install.packages("ggplot2movies")
    library(ggplot2movies)
  2. A data frame with 28819 rows and 24 variables

    • title: 영화제목
    • year: 개봉연도
    • budget: 제작비
    • length: 영화의 길이(분)
    • rating: 평균평점(IMDB user rating)
    • votes: 평점을 남긴 유저의 수
    • r1-10: Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1.
    • mpaa: MPAA rating
    • action, animation, comedy, drama, documentary, romance, short: genre(binary).
  3. Examples

    table(movies$Animation)
    ## 
    ##     0     1 
    ## 55098  3690
    movies_ani <- subset(movies, votes>50 & rating>=8)
    dim(movies_ani)
    ## [1] 1012   24
    head(movies_ani)
    ##                          title year length   budget rating votes  r1  r2
    ## 114 10 from Your Show of Shows 1973     92       NA    8.6    61 4.5 0.0
    ## 128    100 Years at the Movies 1994      9       NA    9.2    91 0.0 4.5
    ## 156               12 Angry Men 1957     96   340000    8.7 29278 4.5 4.5
    ## 163                  12 stulev 1971    161       NA    8.9   252 4.5 0.0
    ## 282      2001: A Space Odyssey 1968    156 10500000    8.3 64982 4.5 4.5
    ## 297                   21 Grams 2003    124 20000000    8.0 21857 4.5 4.5
    ##      r3  r4  r5  r6   r7   r8   r9  r10 mpaa Action Animation Comedy Drama
    ## 114 4.5 4.5 0.0 4.5  4.5 14.5 14.5 44.5           0         0      1     0
    ## 128 0.0 4.5 4.5 4.5  4.5  4.5 14.5 64.5           0         0      0     0
    ## 156 4.5 4.5 4.5 4.5  4.5 24.5 24.5 34.5           0         0      0     1
    ## 163 4.5 0.0 4.5 4.5  4.5  4.5 14.5 64.5           0         0      1     0
    ## 282 4.5 4.5 4.5 4.5  4.5 14.5 14.5 34.5           0         0      0     0
    ## 297 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5    R      0         0      0     1
    ##     Documentary Romance Short
    ## 114           0       0     0
    ## 128           1       0     1
    ## 156           0       0     0
    ## 163           0       0     0
    ## 282           0       0     0
    ## 297           0       0     0
    ani <- table(movies_ani$year)
    ts.plot(ts(ani, start=as.integer(names(ani)[1])), main="평점 8 이상 영화의 수", ylab="count", xlab="year")

9 Wage data

  1. Description: Wage and other data for a group of 3000 male workers in the Mid-Atlantic region in R package ISLR

    # install.packages("ISLR")
    library(ISLR)
  2. A data frame with 3000 observations on the following 11 variables.
    • year: Year that wage information was recorded
    • age: Age of worker
    • maritl: A factor with levels 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated indicating marital status
    • race: A factor with levels 1. White 2. Black 3. Asian and 4. Other indicating race
    • education: A factor with levels 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree indicating education level
    • region: Region of the country (mid-atlantic only)
    • jobclass: A factor with levels 1. Industrial and 2. Information indicating type of job
    • health: A factor with levels 1. <=Good and 2. >=Very Good indicating health level of worker
    • health_ins: A factor with levels 1. Yes and 2. No indicating whether worker has health insurance
    • logwage: Log of workers wage
    • wage: Workers raw wage
  3. Examples

    head(Wage, 3)
    ##        year age           maritl     race       education
    ## 231655 2006  18 1. Never Married 1. White    1. < HS Grad
    ## 86582  2004  24 1. Never Married 1. White 4. College Grad
    ## 161300 2003  45       2. Married 1. White 3. Some College
    ##                    region       jobclass         health health_ins
    ## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
    ## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
    ## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
    ##         logwage      wage
    ## 231655 4.318063  75.04315
    ## 86582  4.255273  70.47602
    ## 161300 4.875061 130.98218
    library(ggplot2)
    ggplot(data = Wage, mapping=aes(x = age, y = wage)) + geom_point()

    ggplot(Wage, aes(x = education, y = wage)) +
        geom_boxplot()

10 서울시 지하철역 정보

  1. 변수설명 : stndata.xlsx의 sheet2

    library(readxl)
    stndata_info <- read_excel("stndata.xlsx", sheet=2)
    stndata_info
    ## # A tibble: 20 x 2
    ##    변수명        변수설명           
    ##    <chr>         <chr>              
    ##  1 LINE_NM       라인넘버           
    ##  2 cd_Do         시도코드           
    ##  3 cd_si_gu      시군구코드         
    ##  4 cd_dong       읍면동코드         
    ##  5 latitude      위도               
    ##  6 longitude     경도               
    ##  7 address       주소               
    ##  8 total_ppln    총인구수           
    ##  9 male_ppln     남자인구수         
    ## 10 female_ppln   여자인구수         
    ## 11 no_hholds     세대수             
    ## 12 no_businesses 사업체수           
    ## 13 no_workers    종사자수           
    ## 14 male_wrks     남자종사자수       
    ## 15 female_wrks   여자종사자수       
    ## 16 m_age0        남자0-4세인구수    
    ## 17 m_age1        남자5-9세인구수    
    ## 18 m_age20       남자100세이상인구수
    ## 19 f_age0        여자0-4세인구수    
    ## 20 f_age20       여자100세이상인구수
  2. 데이터 : stndata.xlsx의 sheet1

    stndata <- read_excel("stndata.xlsx", sheet=1)
    head(stndata)
    ## # A tibble: 6 x 60
    ##    X__1 STN_ID STN_NM LINE_NM cd_Do cd_si_gu cd_dong latitude longitude
    ##   <dbl>  <dbl> <chr>  <chr>   <dbl>    <dbl>   <dbl>    <dbl>     <dbl>
    ## 1     1    150 서울역 1호선      11    11020 1102054     37.6      127.
    ## 2     2    151 시청역 1호선      11    11020 1102052     37.6      127.
    ## 3     3    152 종각역 1호선      11    11010 1101061     37.6      127.
    ## 4     4    153 종로3가역… 1호선      11    11010 1101061     37.6      127.
    ## 5     5    154 종로5가역… 1호선      11    11010 1101063     37.6      127.
    ## 6     6    155 동대문역… 1호선      11    11010 1101068     37.6      127.
    ## # ... with 51 more variables: address <chr>, total_ppln <dbl>,
    ## #   male_ppln <dbl>, female_ppln <dbl>, no_hholds <dbl>,
    ## #   no_businesses <dbl>, no_workers <dbl>, male_wrks <dbl>,
    ## #   female_wrks <dbl>, m_age0 <dbl>, m_age1 <dbl>, m_age2 <dbl>,
    ## #   m_age3 <dbl>, m_age4 <dbl>, m_age5 <dbl>, m_age6 <dbl>, m_age7 <dbl>,
    ## #   m_age8 <dbl>, m_age9 <dbl>, m_age10 <dbl>, m_age11 <dbl>,
    ## #   m_age12 <dbl>, m_age13 <dbl>, m_age14 <dbl>, m_age15 <dbl>,
    ## #   m_age16 <dbl>, m_age17 <dbl>, m_age18 <dbl>, m_age19 <dbl>,
    ## #   m_age20 <dbl>, f_age0 <dbl>, f_age1 <dbl>, f_age2 <dbl>, f_age3 <dbl>,
    ## #   f_age4 <dbl>, f_age5 <dbl>, f_age6 <dbl>, f_age7 <dbl>, f_age8 <dbl>,
    ## #   f_age9 <dbl>, f_age10 <dbl>, f_age11 <dbl>, f_age12 <dbl>,
    ## #   f_age13 <dbl>, f_age14 <dbl>, f_age15 <dbl>, f_age16 <dbl>,
    ## #   f_age17 <dbl>, f_age18 <dbl>, f_age19 <dbl>, f_age20 <dbl>