Javascript required
Skip to content Skip to sidebar Skip to footer

What Type of Variables Can Be Used as Output for the Read Csv Activity?

Read CSV (RapidMiner Studio Core)

Synopsis

This Operator reads an ExampleSet from the specified CSV file.

Description

CSV is an abbreviation for Comma-Separated Values. The CSV files store data (both numerical and text) in plain-text form. All values respective to an Example are stored equally one line in the CSV file. Values for dissimilar Attributes are separated by a separator graphic symbol. The separator remains constant. Each row in the file uses the constant separator for separating Attribute values. The term 'CSV' suggests that the Attribute values would be separated past commas, but other separators can besides be used.

The easiest way to import a CSV file is to use the Import Configuration Magician from the Parameters panel. All parameters tin can also directly be set in the Parameters console. For more details about the Operator, encounter the description of the parameters.

Please make sure that the CSV file is read correctly as an ExampleSet before building a Process that uses it.

Differentiation

In that location are many Read <source> Operators in the Data Access grouping and Files/Read sub-group. For instance, at that place is Read Excel, Read URL, Read SPSS, Read XML and other Operators, which can read ExampleSet from dissimilar file formats.

Input

  • file (File)

    A CSV file can exist optionally passed in equally a file object. This can exist created with Operators having file output ports such as the Read File Operator.

Output

  • output (Information Tabular array)

    This port delivers the ExampleSet created from the CSV file provided at the input port, imported through the Import Configuration Magician or loaded from the path given to the csv file parameter.

Parameters

  • Import_Configuration_Wizard

    This convenient wizard guides you to easily configure this Operator to import the CSV file.

    Range:
  • csv_file

    The path of the CSV file is specified here. It can too be selected using the 'Choose a file' button.

    Range:
  • column_separators

    Column separators for CSV files can be specified hither. Information technology can as well be provided equally a regular expression. A good understanding of regular expressions tin be developed by studying the clarification of Select Attributes Operator and its tutorial Processes.

    Range:
  • trim_lines

    This parameter indicates if lines should be trimmed (removal of empty spaces at the outset and the end) before the cavalcade split is performed. This choice might be problematic if TABs ('\t') are used as separators.

    Range:
  • use_quotes

    This parameter indicates if quotes should be regarded. Quotes tin exist used to store special characters like cavalcade separators. For case if (,) is set up equally column separator and (") is set as quotes character, and then a row (a,b,c,d) will be translated every bit 4 values for iv columns. On the other manus ("a,b,c,d") will exist translated as a single column value a,b,c,d. If this parameter is set to false, the quotes grapheme parameter and the escape graphic symbol parameter cannot be divers.

    Range:
  • quotes_character

    This parameter defines the quotes character and is but available if apply quotes is fix to true.

    Range:
  • escape_character

    This parameter specifies the character used to escape the quotes and is only bachelor if use quotes is set to truthful. For example, if (") is used as quotes character and ('\') is used equally escape character, then ("yes") will exist translated as (yes) and (\"yep\") volition exist translated as ("yes").

    Range:
  • skip_comments

    This parameter is used to ignore comments in the CSV file (if any). If this option is set to truthful, a annotate character should be defined using the comment characters parameter.

    Range:
  • comment_characters

    This parameter is available if comment characters is set to truthful. Lines beginning with these characters are ignored. If this character is present in the middle of the line, anything that comes in that line after this character is ignored. The comment character itself is also ignored.

    Range:
  • parse_numbers

    This parameter specifies whether numbers are parsed or non.

    Range:
  • decimal_character

    This grapheme is used as the decimal character.

    Range:
  • grouped_digits

    This parameter decides whether grouped digits should exist parsed or non. If this parameter is set to truthful, a grouping character parameter has to be specified.

    Range:
  • grouping_character

    This character is used as the grouping grapheme. If this character is institute between numbers, the numbers are combined and this character is ignored. For case if "22-xiv" is nowadays in the CSV file and "-" is set equally the grouping grapheme, then "2214" will be stored.

    Range:
  • infinity_string

    This parameter can exist set to parse a specific infinity representation (e.yard. "Infinity"). If information technology is not prepare, the local specific infinity representation volition be used.

    Range: string
  • date_format

    The parameter specifies the engagement and time format. Many predefined options be merely users can as well specify a new format. If text in a CSV file cavalcade matches this date format, that column is automatically converted to date type.

    Some corrections are automatically made on invalid date values. For case, a value '32-March' will automatically exist converted to '1-Apr'.

    Columns containing values which cannot exist interpreted as numbers will be interpreted equally nominal, as long as they exercise not lucifer the date and time blueprint of the engagement format parameter. If they friction match, this column of the CSV file will be automatically parsed equally date and the corresponding Attribute will be of type appointment.

    Range:
  • first_row_as_names

    If this parameter is set to true, it is causeless that the commencement line of the CSV file has the names of the Attributes. If so, the Attributes are automatically named and the first line of the CSV file is non treated as a data line.

    Range:
  • annotations

    If the first row as names is not set to true, annotations can be added using the 'Edit List' button of this parameter, which opens a new menu. This carte allows y'all to select any row and assign an annotation to information technology. Proper name, Annotate and Unit of measurement annotations can be assigned. If row 0 is assigned a Proper name annotation, information technology is equivalent to setting the first row as names parameter to true. If y'all want to ignore any row, yous can annotate them every bit Comment. Recall that row number in this carte does not count commented lines.

    Range:
  • time_zone

    Users can select any time zone from the list of provided time zones.

    Range:
  • locale

    Users can select any locale from the list of provided locales.

    Range:
  • encoding

    Users tin can select whatsoever encoding from the list of provided encodings.

    Range:
  • read_all_values_as_polynominal

    This selection allows you to disable the blazon treatment for this operator. Every cavalcade will be read equally a polynominal attribute.

    Range:
  • data_set_meta_data_information

    This parameter allows to conform or override the meta data of the CSV file. Column index, name, blazon and role can be specified here.

    The Read CSV Operator automatically tries to determine an appropriate data blazon of the Attributes by reading the outset few lines and checking the occurring values. Integer values are assigned the integer data blazon, real values the real information blazon. Values which cannot exist interpreted as numbers are assigned the nominal information blazon, as long as they do not match the format of the engagement format parameter.

    With the data prepare meta data information parameter, this automatic assignment can exist adjusted or overwritten.

    Range:
  • read_not_matching_values_as_missings

    If this parameter is set to true, values that do non match with the expected value blazon are considered equally missing values and are replaced by '?'. For example, if 'abc' is written in an integer cavalcade, it volition be treated every bit a missing value. A question marking (?) in the CSV file is as well read as a missing value.

    Range:
  • data_management

    This parameter determines how the data is represented internally. Users can select any selection from the provided listing.

    Range:

Tutorial Processes

Read a CSV file

(Optional) Salve the following text in a text file:

att1,att2,att3,att4 # row ane

lxxx.6, yes , 1996.January.21 ,22-14 # row 2

12.43,"yes",1997.MAR.30,23-22 # row 3

13.5,\"no\",1998.AUG.22,23-14 # row 4

23.three,yes,1876.January.32,42-65# row v

21.six,yes,2001.JUL.12,xyz # row six

12.56,",_?",2002.SEP.18,15-xc# row 7

This is a sample CSV file.

(Optional) You can load this with the given tutorial process past providing its path in the csv file parameter or past using the 'Choose a file' button.

Run the Process and compare the results in the Results view with the CSV file. The Procedure performs the following actions:

'#' is defined as a comment character so 'row {number}' is ignored in all rows. As the first row as names parameter is prepare to true, att1, att2, att3 and att4 are gear up as Attribute names. The Attribute att1 is prepare as real , att2 as polynominal, att3 as date and att4 equally real. For Aspect att4, the '-' character is ignored in all rows because the grouped digits parameter is set to truthful and '-' is specified as the group character. In row 2, the white spaces at the first and finish of values are ignored considering trim lines parameter is set to true. In row 3, quotes are not ignored because use quotes is set up to true, the content inside the quotes is taken as the value for Attribute att2. In row 4, (\"no\") is taken as a (no) in quotes, cause the escape character is set to '\'. In row 5, the date value is automatically corrected from 'JAN.32' to 'Feb.1'. In row 6, an invalid real value for the Aspect att4 is replaced by '?' because the read not matching values as missings parameter is ready to true. In row seven, quotes are used to retrieve special characters as values including the column separator (,) and a question mark.

bettswhild1956.blogspot.com

Source: https://docs.rapidminer.com/latest/studio/operators/data_access/files/read/read_csv.html