주피터 노트북에서 Python을 사용하여 github에서 데이터 가져오기

나는 Aurelien Geron의 "scikit-learn and tensorflow를 이용한 실습 기계 학습" 책을 사용하고 있다.

나는 주피터와 파이썬을 처음 사용해본다.

나는 다음 코드를 따르려고 노력하고 있다.

내 문제는 다음 코드로 셀을 실행할 때입니다:

import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

세포 평가는 절대 끝나지 않고, 절대 이런 것이 되지 않는다.

그래서 인터넷 브라우저를 통해 방문했을 때 오류가 나타나 초기 URL에 문제가 있다고 생각했습니다.

그래서 로 바꿨어요.

이제 알겠어요. 그러나 실행하면 다음과 같은 결과가 나타납니다:

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-6-bd66b1fe6daf> in <module>
----> 1 fetch_housing_data()

<ipython-input-5-ef3c39b342d8> in fetch_housing_data(housing_url, housing_path)
      9     tgz_path = os.path.join(housing_path, "housing.tgz")
     10     urllib.request.urlretrieve(housing_url, tgz_path)
---> 11     housing_tgz = tarfile.open(tgz_path)
     12     housing_tgz.extractall(path=housing_path)
     13     housing_tgz.close()

~\Anaconda3\lib\tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1576                         fileobj.seek(saved_pos)
   1577                     continue
-> 1578             raise ReadError("file could not be opened successfully")
   1579 
   1580         elif ":" in mode:

ReadError: file could not be opened successfully

왜 이런 일이 일어나고 어떻게 해결할 수 있을까요?

커널을 다시 시작하고 다시 실행해 보았습니까? 당신이 보고 있는 것은 재현할 수 없습니다.

위에 붙여넣은 첫 번째 코드 블록은 작성된 대로 작동합니다. 수정할 필요가 없습니다. 나는 이것을 아래에서 실행했고, 내가 다른 셀에서 실행했을 때 작동했다:

import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

세포가 완성되는 것을 보지 못하는 것이 단순한 인공물이 아니라고 확신하나요? 독자적으로 확인하려면 저처럼 다른 곳에서 실행할 수 있습니다. 그냥 가서 하단 링크를 눌러 테스트해봤어요. 그리고 나서 나는 당신의 코드를 올라오는 셀에 붙여넣었다. 그 두 개의 셀을 실행한 후에, 나는 내용이 담긴 디렉토리를 가지고 있다.

https://raw.githubusercontent.com/ageron/handson-ml2/master/

나는 이것이 어떤 종류의 링크인지 확신할 수 없다. 누군가 설명해 줄 수 있을 거야. 이 링크만 입력하면 페이지에 액세스할 수 없습니다. 그러나 다음 단락에서 설명한 데이터를 검색하는 데는 효과가 있습니다. 위에 언급한 실제 github 링크를 사용하면 코드가 데이터를 추출할 수 없습니다.

나는 '가져오기'에 다른 행을 추가하여 책의 단계를 사용하여 링크에서 csv 파일을 추출할 수 있었다. "importurlib.request"를 추가했습니다. 이것은 구글 콜라브에서 나에게 효과가 있는 것 같다. urllib를 가져오면 urllib.request도 가져온다고 생각하겠지만 그렇지 않습니다. 나는 그것이 왜 작동하는지 대답할 수 없지만 urllib에 대한 문서는 한 예에서 'importurllib.request'를 가지고 있었고 나는 그 아이디어를 받아들였다.

import pandas as pd

# Specify the URL of the raw CSV file on GitHub
csv_url = 'https://raw.githubusercontent.com/nghait/gpt/main/AIGPTDAT.csv'

# Read the CSV file from the URL into a pandas DataFrame
df = pd.read_csv(csv_url, sep=';')

# Now you can work with the DataFrame
# For example, you can print the first few rows
print(df.head())