Google Cloud Storage (GCS)
You can create a Google Cloud Storage Connector to read from your Google Google Cloud Storage Bucket.
To set up this Connector using a GCP Service Account Key, you will need a GCP Service Account that has access to the bucket where the resources reside. To learn more about creating and managing service accounts within GCP, visit: https://cloud.google.com/iam/docs/creating-managing-service-accounts.
The schema for this Source Connector is defined by the newest file in the folder. All files must have the same schema (number and order of columns). Any files not matching the original schema will be ignored.
Supported file formats: CSV, XLSX, XLS, TXT (comma separated), and ZIP files containing these files.
- Service Account Key with the proper privileges
- Existing GCS Bucket Name
Step 1: Navigate to the Connectors list page, then click + New Connector
Step 2: Under the System prompt, click GCS
Step 3: Enter a Connector Name
Step 4: Select Source Connector
Authentication is accomplished using Service Account Keys. Provide the Service Account JSON key for the account you wish to connect to.
- Service accounts associated with a GCS Source Connector will need the proper Cloud Storage Privileges in order to successfully establish a connection.
Step 1: To find the Bucket Name, first select the Google Cloud Navigation menu, then scroll to Cloud Storage and select Buckets.
Step 2: On the Buckets page, select the name of the bucket you would like to connect to.
Step 3: The Bucket Name can then be copied from the top of the resulting page.
See the Bucket Name highlighted in blue, here called
You can choose to process all source files, or filter the files based on the file name. Any files that do not meet the filter criteria will be ignored. Select one of the options:
- 1.Include all files: If this option is chosen, all of the files in the folder will be processed in chronological order.
- 2.Only include files that: If you choose this option, you can filter which files to process from the source folder based on three options:
Any files that do not meet the filter criteria will be ignored.
- File names starting with,
- File names containing, or
- File names ending with.
Within the source folder, all files can contain column header names or none of the files can contain column header names. Select one of the options:
- 1.All source files contain headers: If this option is selected, we will use the first row as column header names to label the schema within Osmos. Rows two and up will be read as data records.
- 2.No source files contain headers: If this option is selected, we autogenerate column names for the schema within Osmos. All rows, including the first row, will be read as data records.
The delimiter to use when reading files. Delimiters are selectable in the form of a dropdown list:
There are then two available options for how these delimiters should be applied:
- Selected delimiter applies to ..TXT file only...:By default, the delimiter selected from the dropdown list will only apply to
.csv(Comma-separated files) and
.tsv(Tab-separated values) will continue to be processed according to their file extension designation.
- Selected delimiter applies to all files in the folder...: Can be selected for situations when file extension designations should be ignored, and the delimiter selected from the dropdown menu should be the exclusive delimiter for all files processed by the connector.
The source file may have characters that may not be valid. You can choose to keep all characters from the source, or to strip the null characters. Select one of the options:
- 1.Keep all characters from source: If this option is selected, we will retain all characters from the source file, replacing characters we cannot decode with the unicode undefined character.
- 2.Strip null characters: If this option is selected, we filter out all characters that are equal to 0. Useful when dealing with null-terminated strings.
We support three different deduplication methods. You can choose to deduplicate at file level, or record level. Select one of the following options:
- 1.File level deduplication: If this option is selected, deduplication will be performed at a file level only. If a file name is changed, or the file itself is changed, the entire file will be processed in subsequent runs.
- 2.Record level deduplication across all historical data: When this is selected, in addition to file level deduplication, deduplication will be performed at a record level across all the files processed by this Pipeline. An identical record that was already processed in a previous Osmos Pipeline run will not be processed in the current file, nor will duplicated records within the same file.Example:file_a.csv:item, quantityapple, 3orange, 9banana, 2file_b.csv:item, quantitypear, 9apple, 3banana, 2After processing
file_a.csv, if we add
file_b.csvto the same directory and run a job, only the row containing
pear, 9will be processed, as
banana, 2were already seen when
file_a.csvwas processed. The same applies within the same file - if we'd added
file_a.csvinstead of creating
file_b.csv, the net result would be the same:
pear, 9would be the only new row.
- 3.Record level deduplication within individual files: When this is selected, in addition to file level deduplication, deduplication will be performed at a record level, but only within the same file. If the file being processed has the same record appearing multiple times, the record will be processed only once.Example:file_a.csv:item, quantityapple, 3orange, 9banana, 2file_b.csv:item, quantitypear, 9apple, 3banana, 2After processing
file_a.csv, if we add
file_b.csvto the same directory and run a job, all three records in
file_b.csvwill be processed. If instead we'd added those records to
file_a.csv, the duplicated records (
banana, 2) would be skipped, and the new record
pear, 9would be the only new record processed.
We support Starting Cell offset for spreadsheet type data (
.xsv, etc.) in order to crop unnecessary information out of a dataset and to ensure headers are correctly mapped.
The coordinates provided will serve as the starting location from which the data will be read. By default, The data read begins at coordinates (1,1) which will result in a read of all the data in the document. The example below shows in blue where the data has been read, and in white where data has been omitted, based on a configuration of Row 2 Column 2.
Note, that even with no Starting Cell offset in place (i.e. a Row 1, Column 1 configuration) only the first row containing data will begin the reading of the data, omitting any leading rows containing no data.
By default, this connector will read the first sheet of a workbook as its data source. We also support the designation of specific sheets within a workbook to be read. Sheets designated here will be read exclusively, allowing the connector to skip non-relevant sheets, and to read multiple sheets from a single workbook.
We support the use of a parser webhook for the purpose of pre-processing data. This field allows for the designation of a webhook URL. The webhook protocol must also be designated here. Currently, only gRPC webhooks are supported.