Skip to main content

Google Drive

This page contains the setup guide and reference information for the Google Drive source connector.

info

The Google Drive source connector pulls data from a single folder in Google Drive. Subfolders are recursively included in the sync. All files in the specified folder and all sub folders will be considered.

Prerequisites

  • Drive folder link - The link to the Google Drive folder you want to sync files from (includes files located in subfolders)
  • For Airbyte Cloud A Google Workspace user with access to the spreadsheet
  • For Airbyte Open Source:
    • A GCP project
    • Enable the Google Drive API in your GCP project
    • Service Account Key with access to the Spreadsheet you want to replicate

Setup guide

The Google Drive source connector supports authentication via either OAuth or Service Account Key Authentication.

For Airbyte Cloud users, we highly recommend using OAuth, as it significantly simplifies the setup process and allows you to authenticate directly from the Airbyte UI.

For Airbyte Open Source users, we recommend using Service Account Key Authentication. Follow the steps below to create a service account, generate a key, and enable the Google Drive API.

note

If you prefer to use OAuth for authentication with Airbyte Open Source, you can follow Google's OAuth instructions to create an authentication app. Be sure to set the scopes to https://www.googleapis.com/auth/drive.readonly. You will need to obtain your client ID, client secret, and refresh token for the connector setup.

Set up the service account key (Airbyte Open Source)

Create a service account

  1. Open the Service Accounts page in your Google Cloud console.
  2. Select an existing project, or create a new project.
  3. At the top of the page, click + Create service account.
  4. Enter a name and description for the service account, then click Create and Continue.
  5. Under Service account permissions, select the roles to grant to the service account, then click Continue. We recommend the Viewer role.

Generate a key

  1. Go to the API Console/Credentials page and click on the email address of the service account you just created.
  2. In the Keys tab, click + Add key, then click Create new key.
  3. Select JSON as the Key type. This will generate and download the JSON key file that you'll use for authentication. Click Continue.

Enable the Google Drive API

  1. Go to the API Console/Library page.
  2. Make sure you have selected the correct project from the top.
  3. Find and select the Google Drive API.
  4. Click ENABLE.

If your folder is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your folder. Check out this video for how to do this.

Set up the Google Drive source connector in Airbyte

To set up Google Drive as a source in Airbyte Cloud:

  1. Log in to your Airbyte Cloud or Airbyte Open Source account.
  2. In the left navigation bar, click Sources. In the top-right corner, click + New source.
  3. Find and select Google Drive from the list of available sources.
  4. For Source name, enter a name to help you identify this source.
  5. Select your authentication method:

For Airbyte Cloud

  • (Recommended) Select Authenticate via Google (OAuth) from the Authentication dropdown, click Sign in with Google and complete the authentication workflow.

For Airbyte Open Source

  • (Recommended) Select Service Account Key Authentication from the dropdown and enter your Google Cloud service account key in JSON format:

    { "type": "service_account", "project_id": "YOUR_PROJECT_ID", "private_key_id": "YOUR_PRIVATE_KEY", ... }
  • To authenticate your Google account via OAuth, select Authenticate via Google (OAuth) from the dropdown and enter your Google application's client ID, client secret, and refresh token.

  1. For Folder Link, enter the link to the Google Drive folder. To get the link, navigate to the folder you want to sync in the Google Drive UI, and copy the current URL.
  2. Configure the optional Start Date parameter that marks a starting date and time in UTC for data replication. Any files that have not been modified since this specified date/time will not be replicated. Use the provided datepicker (recommended) or enter the desired date programmatically in the format YYYY-MM-DDTHH:mm:ssZ. Leaving this field blank will replicate data from all files that have not been excluded by the Path Pattern and Path Prefix.
  3. Click Set up source and wait for the tests to complete.

Supported sync modes

The Google Drive source connector supports the following sync modes:

FeatureSupported?
Full Refresh SyncYes
Incremental SyncYes
Replicate Incremental DeletesNo
Replicate Multiple Files (pattern matching)Yes
Replicate Multiple Streams (distinct tables)Yes
NamespacesNo

Path Patterns

(tl;dr -> path pattern syntax using wcmatch.glob. GLOBSTAR and SPLIT flags are enabled.)

This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:

  • Referencing many files with just one pattern, e.g. ** would indicate every file in the folder.
  • Referencing future files that don't exist yet (and therefore don't have a specific path).

You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.

Each path pattern is a reference from the root of the folder, so don't include the root folder name itself in the pattern(s).

Some example patterns:

  • ** : match everything.
  • **/*.csv : match all files with specific extension.
  • myFolder/**/*.csv : match all csv files anywhere under myFolder.
  • */** : match everything at least one folder deep.
  • */*/*/** : match everything at least three folders deep.
  • **/file.*|**/file : match every file called "file" with any extension (or no extension).
  • x/*/y/* : match all files that sit in sub-folder x -> any folder -> folder y.
  • **/prefix*.csv : match all csv files with specific prefix.
  • **/prefix*.parquet : match all parquet files with specific prefix.

Let's look at a specific example, matching the following folder layout (MyFolder is the folder specified in the connector config as the root folder, which the patterns are relative to):

MyFolder
-> log_files
-> some_table_files
-> part1.csv
-> part2.csv
-> images
-> more_table_files
-> part3.csv
-> extras
-> misc
-> another_part1.csv

We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:

  • We could pick up every csv file called "partX" with the single pattern **/part*.csv.
  • To be a bit more robust, we could use the dual pattern some_table_files/*.csv|more_table_files/*.csv to pick up relevant files only from those exact folders.
  • We could achieve the above in a single pattern by using the pattern *table_files/*.csv. This could however cause problems in the future if new unexpected folders started being created.
  • We can also recursively wildcard, so adding the pattern extras/**/*.csv would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".

As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.

User Schema

When using the Avro, Jsonl, CSV or Parquet format, you can provide a schema to use for the output stream. Note that this doesn't apply to the experimental Document file type format.

Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:

  • You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the _ab_additional_properties map.
  • Your initial dataset is quite small (in terms of number of records), and you think the automatic type inference from this sample might not be representative of the data in the future.
  • You want to purposely define types for every column.
  • You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the _ab_additional_properties map.

Or any other reason! The schema must be provided as valid JSON as a map of {"column": "datatype"} where each datatype is one of:

  • string
  • number
  • integer
  • object
  • array
  • boolean
  • null

For example:

  • {"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
  • {"username": "string", "friends": "array", "information": "object"}

File Format Settings

CSV

Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.

  • Header Definition: How headers will be defined. User Provided assumes the CSV does not have a header row and uses the headers provided and Autogenerated assumes the CSV does not have a header row and the CDK will generate headers using for f{i} where i is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can set a value for the "Skip rows before header" option to ignore the header row.
  • Delimiter: Even though CSV is an acronym for Comma Separated Values, it is used more generally as a term for flat file data that may or may not be comma separated. The delimiter field lets you specify which character acts as the separator. To use tab-delimiters, you can set this value to \t. By default, this value is set to ,.
  • Double Quote: This option determines whether two quotes in a quoted CSV value denote a single quote in the data. Set to True by default.
  • Encoding: Some data may use a different character set (typically when different alphabets are involved). See the list of allowable encodings here. By default, this is set to utf8.
  • Escape Character: An escape character can be used to prefix a reserved character and ensure correct parsing. A commonly used character is the backslash (\). For example, given the following data:
Product,Description,Price
Jeans,"Navy Blue, Bootcut, 34\"",49.99

The backslash (\) is used directly before the second double quote (") to indicate that it is not the closing quote for the field, but rather a literal double quote character that should be included in the value (in this example, denoting the size of the jeans in inches: 34" ).

Leaving this field blank (default option) will disallow escaping.

  • False Values: A set of case-sensitive strings that should be interpreted as false values.
  • Null Values: A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.
  • Quote Character: In some cases, data values may contain instances of reserved characters (like a comma, if that's the delimiter). CSVs can handle this by wrapping a value in defined quote characters so that on read it can parse it correctly. By default, this is set to ".
  • Skip Rows After Header: The number of rows to skip after the header row.
  • Skip Rows Before Header: The number of rows to skip before the header row.
  • Strings Can Be Null: Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.
  • True Values: A set of case-sensitive strings that should be interpreted as true values.

Parquet

Apache Parquet is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. At the moment, partitioned parquet datasets are unsupported. The following settings are available:

  • Convert Decimal Fields to Floats: Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.

Avro

The Avro parser uses the Fastavro library. The following settings are available:

  • Convert Double Fields to Strings: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.

JSONL

There are currently no options for JSONL parsing.

Document File Type Format (Experimental)

warning

The Document file type format is currently an experimental feature and not subject to SLAs. Use at your own risk.

The Document file type format is a special format that allows you to extract text from Markdown, TXT, PDF, Word, Powerpoint and Google documents. If selected, the connector will extract text from the documents and output it as a single field named content. The document_key field will hold a unique identifier for the processed file which can be used as a primary key. The content of the document will contain markdown formatting converted from the original file format. Each file matching the defined glob pattern needs to either be a markdown (md), PDF (pdf) or Docx (docx) file.

One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.

Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs.

Usage with airbyte-lib

For document file type streams, make sure the tesseract and pdftotext libraries are installed. On MacOS, you can install them with brew install poppler tesseract.

Install the Python library via: pip install airbyte-lib

Then, execute a sync by loading the connector like this:

import airbyte_lib as ab

config = {
  "start_date": "6606-12-31T23:26:59.0Z",
  "streams": [
    {
      "name": "qui occaecat",
      "validation_policy": "Emit Record",
      "primary_key": "ut occaecat et ut",
      "days_to_sync_if_history_is_full": -60182805,
      "format": {
        "filetype": "jsonl"
      }
    }
  ],
  "folder_url": "https://drivergoogleAcom/f|",
  "credentials": {
    "auth_type": "Client",
    "client_id": "Excepteur occaecat in aliqua",
    "client_secret": "incididunt adipisicing proident",
    "refresh_token": "in aliquip anim"
  }
}

result = ab.get_connector(
    "source-google-drive",
    config=config,
).read_all()

for record in result.cache.streams["my_stream:name"]:
  print(record)

You can find more information in the airbyte_lib quickstart guide.

{
  "title": "Config fields",
  "type": "object",
  "properties": {
    "start_date": {
      "title": "Start Date",
      "description": "UTC date and time in the format 2017-01-25T00:00:00.000000Z. Any file modified before this date will not be replicated.",
      "examples": [
        "2021-01-01T00:00:00.000000Z"
      ],
      "format": "date-time",
      "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{6}Z$",
      "pattern_descriptor": "YYYY-MM-DDTHH:mm:ss.SSSSSSZ",
      "order": 1,
      "type": "string"
    },
    "streams": {
      "title": "The list of streams to sync",
      "description": "Each instance of this configuration defines a <a href=\"https://docs.airbyte.com/cloud/core-concepts#stream\">stream</a>. Use this to define which files belong in the stream, their format, and how they should be parsed and validated. When sending data to warehouse destination such as Snowflake or BigQuery, each stream is a separate table.",
      "order": 10,
      "type": "array",
      "items": {
        "title": "FileBasedStreamConfig",
        "type": "object",
        "properties": {
          "name": {
            "title": "Name",
            "description": "The name of the stream.",
            "type": "string"
          },
          "globs": {
            "title": "Globs",
            "description": "The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look <a href=\"https://en.wikipedia.org/wiki/Glob_(programming)\">here</a>.",
            "default": [
              "**"
            ],
            "order": 1,
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "validation_policy": {
            "title": "Validation Policy",
            "description": "The name of the validation policy that dictates sync behavior when a record does not adhere to the stream schema.",
            "default": "Emit Record",
            "enum": [
              "Emit Record",
              "Skip Record",
              "Wait for Discover"
            ]
          },
          "input_schema": {
            "title": "Input Schema",
            "description": "The schema that will be used to validate records extracted from the file. This will override the stream schema that is auto-detected from incoming files.",
            "type": "string"
          },
          "primary_key": {
            "title": "Primary Key",
            "description": "The column or columns (for a composite key) that serves as the unique identifier of a record. If empty, the primary key will default to the parser's default primary key.",
            "airbyte_hidden": true,
            "type": "string"
          },
          "days_to_sync_if_history_is_full": {
            "title": "Days To Sync If History Is Full",
            "description": "When the state history of the file store is full, syncs will only read files that were last modified in the provided day range.",
            "default": 3,
            "type": "integer"
          },
          "format": {
            "title": "Format",
            "description": "The configuration options that are used to alter how to read incoming files that deviate from the standard formatting.",
            "type": "object",
            "oneOf": [
              {
                "title": "Avro Format",
                "type": "object",
                "properties": {
                  "filetype": {
                    "title": "Filetype",
                    "default": "avro",
                    "const": "avro",
                    "type": "string"
                  },
                  "double_as_string": {
                    "title": "Convert Double Fields to Strings",
                    "description": "Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.",
                    "default": false,
                    "type": "boolean"
                  }
                },
                "required": [
                  "filetype"
                ]
              },
              {
                "title": "CSV Format",
                "type": "object",
                "properties": {
                  "filetype": {
                    "title": "Filetype",
                    "default": "csv",
                    "const": "csv",
                    "type": "string"
                  },
                  "delimiter": {
                    "title": "Delimiter",
                    "description": "The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\\t'.",
                    "default": ",",
                    "type": "string"
                  },
                  "quote_char": {
                    "title": "Quote Character",
                    "description": "The character used for quoting CSV values. To disallow quoting, make this field blank.",
                    "default": "\"",
                    "type": "string"
                  },
                  "escape_char": {
                    "title": "Escape Character",
                    "description": "The character used for escaping special characters. To disallow escaping, leave this field blank.",
                    "type": "string"
                  },
                  "encoding": {
                    "title": "Encoding",
                    "description": "The character encoding of the CSV data. Leave blank to default to <strong>UTF8</strong>. See <a href=\"https://docs.python.org/3/library/codecs.html#standard-encodings\" target=\"_blank\">list of python encodings</a> for allowable options.",
                    "default": "utf8",
                    "type": "string"
                  },
                  "double_quote": {
                    "title": "Double Quote",
                    "description": "Whether two quotes in a quoted CSV value denote a single quote in the data.",
                    "default": true,
                    "type": "boolean"
                  },
                  "null_values": {
                    "title": "Null Values",
                    "description": "A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.",
                    "default": [],
                    "type": "array",
                    "items": {
                      "type": "string"
                    },
                    "uniqueItems": true
                  },
                  "strings_can_be_null": {
                    "title": "Strings Can Be Null",
                    "description": "Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.",
                    "default": true,
                    "type": "boolean"
                  },
                  "skip_rows_before_header": {
                    "title": "Skip Rows Before Header",
                    "description": "The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field.",
                    "default": 0,
                    "type": "integer"
                  },
                  "skip_rows_after_header": {
                    "title": "Skip Rows After Header",
                    "description": "The number of rows to skip after the header row.",
                    "default": 0,
                    "type": "integer"
                  },
                  "header_definition": {
                    "title": "CSV Header Definition",
                    "description": "How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows.",
                    "default": {
                      "header_definition_type": "From CSV"
                    },
                    "oneOf": [
                      {
                        "title": "From CSV",
                        "type": "object",
                        "properties": {
                          "header_definition_type": {
                            "title": "Header Definition Type",
                            "default": "From CSV",
                            "const": "From CSV",
                            "type": "string"
                          }
                        },
                        "required": [
                          "header_definition_type"
                        ]
                      },
                      {
                        "title": "Autogenerated",
                        "type": "object",
                        "properties": {
                          "header_definition_type": {
                            "title": "Header Definition Type",
                            "default": "Autogenerated",
                            "const": "Autogenerated",
                            "type": "string"
                          }
                        },
                        "required": [
                          "header_definition_type"
                        ]
                      },
                      {
                        "title": "User Provided",
                        "type": "object",
                        "properties": {
                          "header_definition_type": {
                            "title": "Header Definition Type",
                            "default": "User Provided",
                            "const": "User Provided",
                            "type": "string"
                          },
                          "column_names": {
                            "title": "Column Names",
                            "description": "The column names that will be used while emitting the CSV records",
                            "type": "array",
                            "items": {
                              "type": "string"
                            }
                          }
                        },
                        "required": [
                          "column_names",
                          "header_definition_type"
                        ]
                      }
                    ],
                    "type": "object"
                  },
                  "true_values": {
                    "title": "True Values",
                    "description": "A set of case-sensitive strings that should be interpreted as true values.",
                    "default": [
                      "y",
                      "yes",
                      "t",
                      "true",
                      "on",
                      "1"
                    ],
                    "type": "array",
                    "items": {
                      "type": "string"
                    },
                    "uniqueItems": true
                  },
                  "false_values": {
                    "title": "False Values",
                    "description": "A set of case-sensitive strings that should be interpreted as false values.",
                    "default": [
                      "n",
                      "no",
                      "f",
                      "false",
                      "off",
                      "0"
                    ],
                    "type": "array",
                    "items": {
                      "type": "string"
                    },
                    "uniqueItems": true
                  }
                },
                "required": [
                  "filetype"
                ]
              },
              {
                "title": "Jsonl Format",
                "type": "object",
                "properties": {
                  "filetype": {
                    "title": "Filetype",
                    "default": "jsonl",
                    "const": "jsonl",
                    "type": "string"
                  }
                },
                "required": [
                  "filetype"
                ]
              },
              {
                "title": "Parquet Format",
                "type": "object",
                "properties": {
                  "filetype": {
                    "title": "Filetype",
                    "default": "parquet",
                    "const": "parquet",
                    "type": "string"
                  },
                  "decimal_as_float": {
                    "title": "Convert Decimal Fields to Floats",
                    "description": "Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.",
                    "default": false,
                    "type": "boolean"
                  }
                },
                "required": [
                  "filetype"
                ]
              },
              {
                "title": "Document File Type Format (Experimental)",
                "type": "object",
                "properties": {
                  "filetype": {
                    "title": "Filetype",
                    "default": "unstructured",
                    "const": "unstructured",
                    "type": "string"
                  },
                  "skip_unprocessable_files": {
                    "title": "Skip Unprocessable Files",
                    "description": "If true, skip files that cannot be parsed and pass the error message along as the _ab_source_file_parse_error field. If false, fail the sync.",
                    "default": true,
                    "always_show": true,
                    "type": "boolean"
                  },
                  "strategy": {
                    "title": "Parsing Strategy",
                    "description": "The strategy used to parse documents. `fast` extracts text directly from the document which doesn't work for all files. `ocr_only` is more reliable, but slower. `hi_res` is the most reliable, but requires an API key and a hosted instance of unstructured and can't be used with local mode. See the unstructured.io documentation for more details: https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf",
                    "default": "auto",
                    "always_show": true,
                    "order": 0,
                    "enum": [
                      "auto",
                      "fast",
                      "ocr_only",
                      "hi_res"
                    ],
                    "type": "string"
                  },
                  "processing": {
                    "title": "Processing",
                    "description": "Processing configuration",
                    "default": {
                      "mode": "local"
                    },
                    "type": "object",
                    "oneOf": [
                      {
                        "title": "Local",
                        "type": "object",
                        "properties": {
                          "mode": {
                            "title": "Mode",
                            "default": "local",
                            "const": "local",
                            "enum": [
                              "local"
                            ],
                            "type": "string"
                          }
                        },
                        "description": "Process files locally, supporting `fast` and `ocr` modes. This is the default option.",
                        "required": [
                          "mode"
                        ]
                      }
                    ]
                  }
                },
                "description": "Extract text from document formats (.pdf, .docx, .md, .pptx) and emit as one record per file.",
                "required": [
                  "filetype"
                ]
              }
            ]
          },
          "schemaless": {
            "title": "Schemaless",
            "description": "When enabled, syncs will not validate or structure records against the stream's schema.",
            "default": false,
            "type": "boolean"
          }
        },
        "required": [
          "name",
          "format"
        ]
      }
    },
    "folder_url": {
      "title": "Folder Url",
      "description": "URL for the folder you want to sync. Using individual streams and glob patterns, it's possible to only sync a subset of all files located in the folder.",
      "examples": [
        "https://drive.google.com/drive/folders/1Xaz0vXXXX2enKnNYU5qSt9NS70gvMyYn"
      ],
      "order": 0,
      "pattern": "^https://drive.google.com/.+",
      "pattern_descriptor": "https://drive.google.com/drive/folders/MY-FOLDER-ID",
      "type": "string"
    },
    "credentials": {
      "title": "Authentication",
      "description": "Credentials for connecting to the Google Drive API",
      "type": "object",
      "oneOf": [
        {
          "title": "Authenticate via Google (OAuth)",
          "type": "object",
          "properties": {
            "auth_type": {
              "title": "Auth Type",
              "default": "Client",
              "const": "Client",
              "enum": [
                "Client"
              ],
              "type": "string"
            },
            "client_id": {
              "title": "Client ID",
              "description": "Client ID for the Google Drive API",
              "airbyte_secret": true,
              "type": "string"
            },
            "client_secret": {
              "title": "Client Secret",
              "description": "Client Secret for the Google Drive API",
              "airbyte_secret": true,
              "type": "string"
            },
            "refresh_token": {
              "title": "Refresh Token",
              "description": "Refresh Token for the Google Drive API",
              "airbyte_secret": true,
              "type": "string"
            }
          },
          "required": [
            "client_id",
            "client_secret",
            "refresh_token",
            "auth_type"
          ]
        },
        {
          "title": "Service Account Key Authentication",
          "type": "object",
          "properties": {
            "auth_type": {
              "title": "Auth Type",
              "default": "Service",
              "const": "Service",
              "enum": [
                "Service"
              ],
              "type": "string"
            },
            "service_account_info": {
              "title": "Service Account Information",
              "description": "The JSON key of the service account to use for authorization. Read more <a href=\"https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys\">here</a>.",
              "airbyte_secret": true,
              "type": "string"
            }
          },
          "required": [
            "service_account_info",
            "auth_type"
          ]
        }
      ]
    }
  },
  "required": [
    "streams",
    "folder_url",
    "credentials"
  ]
}
start_date
Start Date
type: string
pattern: ^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{6}Z$ (YYYY-MM-DDTHH:mm:ss.SSSSSSZ)
example: 2021-01-01T00:00:00.000000Z
UTC date and time in the format 2017-01-25T00:00:00.000000Z. Any file modified before this date will not be replicated.
streams
The list of streams to sync
REQUIRED!
type: array
Each instance of this configuration defines a stream. Use this to define which files belong in the stream, their format, and how they should be parsed and validated. When sending data to warehouse destination such as Snowflake or BigQuery, each stream is a separate table.
items[x]
FileBasedStreamConfig
type: object
name
Name
REQUIRED!
type: string
The name of the stream.
globs
Globs
type: array
default:
[
  "**"
]
The pattern used to specify which files should be selected from the file system. For more information on glob pattern matching look here.
items[x]
type: string
validation_policy
Validation Policy
type:
default:
"Emit Record"
The name of the validation policy that dictates sync behavior when a record does not adhere to the stream schema.
input_schema
Input Schema
type: string
The schema that will be used to validate records extracted from the file. This will override the stream schema that is auto-detected from incoming files.
primary_key
Primary Key
type: string
The column or columns (for a composite key) that serves as the unique identifier of a record. If empty, the primary key will default to the parser's default primary key.
days_to_sync_if_history_is_full
Days To Sync If History Is Full
type: integer
default:
3
When the state history of the file store is full, syncs will only read files that were last modified in the provided day range.
format
Format
REQUIRED!
type: object
The configuration options that are used to alter how to read incoming files that deviate from the standard formatting.
oneOf:
  • Avro Format
    type: object
    filetype
    Filetype
    REQUIRED!
    constant value:
    avro
    type: string
    double_as_string
    Convert Double Fields to Strings
    type: boolean
    Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.
  • CSV Format
    type: object
    filetype
    Filetype
    REQUIRED!
    constant value:
    csv
    type: string
    delimiter
    Delimiter
    type: string
    default:
    ","
    The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\t'.
    quote_char
    Quote Character
    type: string
    default:
    "\""
    The character used for quoting CSV values. To disallow quoting, make this field blank.
    escape_char
    Escape Character
    type: string
    The character used for escaping special characters. To disallow escaping, leave this field blank.
    encoding
    Encoding
    type: string
    default:
    "utf8"
    The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options.
    double_quote
    Double Quote
    type: boolean
    default:
    true
    Whether two quotes in a quoted CSV value denote a single quote in the data.
    null_values
    Null Values
    type: array
    default:
    []
    A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.
    items[x]
    type: string
    strings_can_be_null
    Strings Can Be Null
    type: boolean
    default:
    true
    Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.
    skip_rows_before_header
    Skip Rows Before Header
    type: integer
    0
    The number of rows to skip before the header row. For example, if the header row is on the 3rd row, enter 2 in this field.
    skip_rows_after_header
    Skip Rows After Header
    type: integer
    0
    The number of rows to skip after the header row.
    header_definition
    CSV Header Definition
    type: object
    default:
    {
      "header_definition_type": "From CSV"
    }
    How headers will be defined. `User Provided` assumes the CSV does not have a header row and uses the headers provided and `Autogenerated` assumes the CSV does not have a header row and the CDK will generate headers using for `f{i}` where `i` is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can skip rows.
    oneOf:
    • From CSV
      type: object
      header_definition_type
      Header Definition Type
      REQUIRED!
      constant value:
      From CSV
      type: string
    • Autogenerated
      type: object
      header_definition_type
      Header Definition Type
      REQUIRED!
      constant value:
      Autogenerated
      type: string
    • User Provided
      type: object
      header_definition_type
      Header Definition Type
      REQUIRED!
      constant value:
      User Provided
      type: string
      column_names
      Column Names
      REQUIRED!
      type: array
      The column names that will be used while emitting the CSV records
      items[x]
      type: string
    true_values
    True Values
    type: array
    default:
    [
      "y",
      "yes",
      "t",
      "true",
      "on",
      "1"
    ]
    A set of case-sensitive strings that should be interpreted as true values.
    items[x]
    type: string
    false_values
    False Values
    type: array
    default:
    [
      "n",
      "no",
      "f",
      "false",
      "off",
      "0"
    ]
    A set of case-sensitive strings that should be interpreted as false values.
    items[x]
    type: string
  • Jsonl Format
    type: object
    filetype
    Filetype
    REQUIRED!
    constant value:
    jsonl
    type: string
  • Parquet Format
    type: object
    filetype
    Filetype
    REQUIRED!
    constant value:
    parquet
    type: string
    decimal_as_float
    Convert Decimal Fields to Floats
    type: boolean
    Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.
  • Document File Type Format (Experimental)
    type: object
    Extract text from document formats (.pdf, .docx, .md, .pptx) and emit as one record per file.
    filetype
    Filetype
    REQUIRED!
    constant value:
    unstructured
    type: string
    skip_unprocessable_files
    Skip Unprocessable Files
    type: boolean
    default:
    true
    If true, skip files that cannot be parsed and pass the error message along as the _ab_source_file_parse_error field. If false, fail the sync.
    strategy
    Parsing Strategy
    type: string
    default:
    "auto"
    The strategy used to parse documents. `fast` extracts text directly from the document which doesn't work for all files. `ocr_only` is more reliable, but slower. `hi_res` is the most reliable, but requires an API key and a hosted instance of unstructured and can't be used with local mode. See the unstructured.io documentation for more details: https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf
    processing
    Processing
    type: object
    default:
    {
      "mode": "local"
    }
    Processing configuration
    oneOf:
    • Local
      type: object
      Process files locally, supporting `fast` and `ocr` modes. This is the default option.
      mode
      Mode
      REQUIRED!
      constant value:
      local
      type: string
schemaless
Schemaless
type: boolean
When enabled, syncs will not validate or structure records against the stream's schema.
folder_url
Folder Url
REQUIRED!
type: string
pattern: ^https://drive.google.com/.+ (https://drive.google.com/drive/folders/MY-FOLDER-ID)
example: https://drive.google.com/drive/folders/1Xaz0vXXXX2enKnNYU5qSt9NS70gvMyYn
URL for the folder you want to sync. Using individual streams and glob patterns, it's possible to only sync a subset of all files located in the folder.
credentials
Authentication
REQUIRED!
type: object
Credentials for connecting to the Google Drive API
oneOf:
  • Authenticate via Google (OAuth)
    type: object
    auth_type
    Auth Type
    REQUIRED!
    constant value:
    Client
    type: string
    client_id
    Client ID
    REQUIRED!
    type: string
    Client ID for the Google Drive API
    client_secret
    Client Secret
    REQUIRED!
    type: string
    Client Secret for the Google Drive API
    refresh_token
    Refresh Token
    REQUIRED!
    type: string
    Refresh Token for the Google Drive API
  • Service Account Key Authentication
    type: object
    auth_type
    Auth Type
    REQUIRED!
    constant value:
    Service
    type: string
    service_account_info
    Service Account Information
    REQUIRED!
    type: string
    The JSON key of the service account to use for authorization. Read more here.

Changelog

VersionDatePull RequestSubject
0.0.62023-12-1633414Prepare for airbyte-lib
0.0.52023-12-1433411Bump CDK version to auto-set primary key for document file streams and support raw txt files
0.0.42023-12-0633187Bump CDK version to hide source-defined primary key
0.0.32023-11-1631458Improve folder id input and update document file type parser
0.0.22023-11-0231458Allow syncs on shared drives
0.0.12023-11-0231458Initial Google Drive source